## Elixir Ecosystem Survey 2020

This is a quick look at the Elixir Ecosystem Survey 2020. Data was obtained from Hugo Baraúna's GitHub [repo](https://github.com/hugobarauna/elixir-ecosystem-2020-reponses-data).

## About the Survey

The survey was conducted over the course of the summer in 2020 by Brian Cardarella, who led the efforts to create and dissemniate the [survey](https://elixirforum.com/t/2020-elixir-ecosystem-survey/32396). The results of the survey were presented at [ElixirConf 2020](https://www.youtube.com/watch?v=-nVgAcy9wB0) by Brian. 

The survey was created on Typeform, but has since been taken down. Hugo kindly extracted the data and also created a PSQL dump of the normalized data. For this analysis, only the raw data extract was used.

### Packages used

In [None]:
library(tidyverse)
library(ggplot2)
library(gridExtra)

### Read and Cleaning the Data

I'm simply reading the csv file, and assigning all the columns to be character type. I could possibly encode the date time fields (eg. `Start Date (UTC)`), but not sure it's worth the effort right now...

In [None]:
raw <- read_csv("elixir-ecosystem-survey-2020-raw-data.csv", col_types = cols(.default = "c"))
raw

A peculiarity of the data dump is that a row represents the answers from one respondent (ie. one survey response per row), where the column names are the questions themselves, or an answer choice, when a choice was given in the question, as well as some metadata. 

**An important issue with the data was that there were 14 multiple choice questions explicity missing from the data, but the choices had been preserved.** For now, I've imputed what I believe are the questions that were asked.

In [None]:
additions <- c(
  "What tool do you use to format code?",
  "What CI/CD tool do you use?",
  "How do you deploy Elixir in production?",
  "What platform do you use to deploy your code in production?",
  "What database do you most often use with Elixir?",
  "What tool do you most often use to debug Elixir code?",
  "What were challenges to adopting Elixir for your team?",
  "What benefits did your team experience by using Elixir?",
  "In what capacities is your company using Elixir?",
  "Do you subscribe to any Elixir newsletters?",
  "Do you listen to any Elixir podcasts?",
  "Do you participate in any Elixir forums or communities?",
  "How do you prefer to learn a new language?",
  "How were you first introduced to Elixir?"
)

I'll convert the data into [long form or "tidy data"](https://vita.had.co.nz/papers/tidy-data.html) which will allow analysis of the survey responses to be much easier.

In [None]:
long <-
  raw %>%
  mutate(`What tool do you use to format code?` = NA, .after = 162) %>%
  mutate(`What CI/CD tool do you use?` = NA, .after = 150) %>%
  mutate(`How do you deploy Elixir in production?` = NA, .after = 143) %>%
  mutate(`What platform do you use to deploy your code in production?` = NA, .after = 137) %>%
  mutate(`What database do you most often use with Elixir?` = NA, .after = 130) %>%
  mutate(`What tool do you most often use to debug Elixir code?` = NA, .after = 123) %>%
  mutate(`What were challenges to adopting Elixir for your team?` = NA, .after = 112) %>%
  mutate(`What benefits did your team experience by using Elixir?` = NA, .after = 106) %>%
  mutate(`In what capacities is your company using Elixir?` = NA, .after = 92) %>%
  mutate(`Do you subscribe to any Elixir newsletters?` = NA, .after = 82) %>%
  mutate(`Do you listen to any Elixir podcasts?` = NA, .after = 75) %>%
  mutate(`Do you participate in any Elixir forums or communities?` = NA, .after = 66) %>%
  mutate(`How do you prefer to learn a new language?` = NA, .after = 24) %>%
  mutate(`How were you first introduced to Elixir?` = NA, .after = 18) %>%
  pivot_longer(
    !c(`#`, contains("UTC"), `Network ID`),
    names_to = "question",
    values_to = "answer"
  ) %>%
  janitor::clean_names()

#long

Now that the data is in long form, I separate the questions from the choices. Unfortunately there is no programmatic way to do this so I manually created a `questions` table. I created a temporary table `questions_with_choices` to make the manual extraction of the questions a lot easier.

In [None]:
questions_with_choices <-
  long %>%
  distinct(question)

# write_csv(questions_with_choices, "./output/questions_with_choices.csv")

# questions_with_choices

In [None]:
qs <- c(
  "Are you actively using Elixir in either professional or personal projects?",
  "What solutions was the Elixir ecosystem lacking?",
  "How long have you been using Elixir?",
  "What is the most recent version of Elixir that you have used?",
  "Have you written any Erlang?",
  "What is your age range?",
  "Which gender do you identify as?",
  "In which country do you currently reside?",
  "Do you have a college degree in Computer Science or similar degree?",
  "What part of Elixir did you find most difficult to learn?",
  "Do you maintain any Open Source (OSS) Elixir libraries?",
  "Have you made contributions to anyone else's OSS Elixir libraries?",
  "Have you made OSS contributions back to Elixir?",
  "How often do you attend local Elixir meetups?",
  "Do you help organize Elixir meetups?",
  "Do you attend your continent's major Elixir Conference",
  "Do you attend any regional Elixir/Erlang conferences?",
  "What industry is your company in?",
  "What is your role within your company?",
  "Does your company use Elixir?",
  "How long has your company been using Elixir?",
  "How many engineers are using Elixir at your company?",
  "Did your company migrate from another language or choose Elixir for a new project?",
  "Can you say which language(s) and describe why it won?",
  "Which operating system do you primarily develop on?",
  "Which editor/IDE do you primarily write Elixir with?",
  "Which operating system do you deploy to?",
  "Have you ever used Hot Code Reloading in production?",
  "If there is one library that you are excited about in 2020 which is it?",
  "Are you using Phoenix?",
  "What is the most recent version of Phoenix that you have used?",
  "Are you running Phoenix in production?",
  "Are you using Nerves?",
  "What is the most recent version of Nerves that you have used?",
  "Are you using Scenic?",
  "Is your Nerves application distributed across many devices?",
  additions
)

questions <- tibble(id = seq_along(qs), question = qs)

# write_csv(questions, "./ouptut/questions.csv")
# questions

An assumption I've made here is that the data from Typeform is organized so that choices immediately follow the question, which allows me to create a relationship between questions and choices. For example the question "Which Elixir newsletters do you subscribe to?", is immediately followed by "Elixir Radar", "ElixirWeekly", "ElixirDigest", "Other". 

In [None]:
choices <-
  questions_with_choices %>%
  left_join(questions, by = c("question" = "question")) %>%
  fill(id) %>%
  anti_join(questions, by = c("question" = "question")) %>%
  rename(question_id = id) %>%
  rename(choice = question) %>%
  mutate(id = seq_len(nrow(.))) %>%
  select(id, question_id, choice)

# write_csv(choices, "./output/choices.csv")

#choices

With the data separated into 3 tables `survey`, `questions` and `choices`, it's easy to perform basic aggregations and plotting on the data. Here's one example of creating a summary bar plot for each question.

In [None]:
## Helper functions
summarize_survey <- function(df) {
  df %>%
    group_by(answer) %>%
    summarize(n = n()) %>%
    mutate(prop = n / sum(n)) %>%
    arrange(desc(prop)) %>%
    slice_head(n = 15) # takes top n results
}

barplot_survey <- function(df, question) {
  ggplot(df, aes(prop, reorder(answer, prop))) +
    geom_col() +
    theme_bw() +
    ggtitle(question) +
    ylab("") +
    xlab("Proportion (%)") +
    scale_x_continuous(labels = scales::label_percent()) +
    scale_y_discrete(labels = function(x) stringr::str_trunc(x, 40, "right"))
}


In [None]:
survey_plots <-
  long %>%
  inner_join(questions, by = c("question" = "question")) %>%
  filter(!(question %in% additions)) %>%
  replace_na(list(answer = "Did not answer.")) %>%
  mutate(
    answer = ifelse(answer == "1", "Yes", answer),
    answer = ifelse(answer == "0", "No", answer)
  ) %>%
  group_by(question) %>%
  nest() %>%
  mutate(
    summary = map(data, summarize_survey),
    plot = map2(summary, question, barplot_survey)
  )

## Top 15 Results for Survey Responses

In [None]:
# Print the plots
options(warn=-1)
print(survey_plots$plot)