# Preprocessing csv files with exported responses from Indico-based surveys

## Purpose
- To reform indico-based survey response file structures so that they have the column headers and types of answer alternatives consistent with the pre-processed csv files based on surveys on Typeform for analyzing all the post-workshop survey data.
- To keep data regarding new questions as well to analyze workshop specific data.
- To make a basis for revising the analysis script `survey_analysis.ipynb`.

## Methods
1. Analyze and compare relevant files regardindg the followings:
    - Column headers and which column each header is located, and
    - Answer alternatives shown in the exported questions of each survey in JSON format
2. For each response file exported from Indico, 
    - decide which column to be;
        - deleted,
        - renamed, and/or
        - moved, and
    - decide how to convert selected answer alternatives to column-data entry combinations
   by keeping the format of the "workshop-followup-survey-2018_processed.csv" as much as possible
    
## Results

### Documentation of inconsistency 
`exported_responses/readme.md` describes the differences in the survey questions and answer alternatives between relevant surveys.

### Strategy for pre-processing of each response file
#### Common for all files
First of all, do the following 2 deletions before going to the further steps.
- Delete the following columns:
    - Submitter
    - Submitter Email
    - Submission Date
- Delete the section names (from the first char to the first ":") from each column header.

All the followings assume that the section names are already deleted from each colum header.
- Make a new column "Other" after "What is your current position?" (or modify the 2017 and 2018 files and set "Other" in the column "What is your current position?" for rows that have a value in "Other" column)
- Make the following new columns: "Reusable", "Reproducible", "Modular", and "Documented". 
- Convert the answers to "Would you judge your code to be better reusable/reproducible/modular/documented as a result of attending the workshop?" to "1" or "0" for the columns made in the previous step. 
- Convert the "yes"/"no" answers to the following questions to "1"/"0" respectively:
    - Has it become easier for you to collaborate on software development with your colleagues and collaborators?
    - Have you introduced one or more of your colleagues to new tools or practices as a result of the workshop?
- Keep the free-text answers in another file and delete these columns. The relevant column headers are the followings.
    - What else has changed in how you write code for your research after attending a CodeRefinery workshop?
    - Do you have any recommendations on how we should change the CodeRefinery curriculum?

#### event_109_survey_24.csv
- Move the column "Would you recommend your colleagues to attend a CodeRefinery workshop?" to the end.

#### event_194_survey_48.csv
- Move the column "Would you recommend your colleagues to attend a CodeRefinery workshop?" to the end.
- Move the column "Participation style" to the end.


## Discussion
### Items to be discussed: Questions and responses about the workshop's impact on use of introduced tools
Indico-based surveys changed the way to let respondents to answer questions regarding the impact of the workshop on participants' practice in research software engineering from Typeform-based ones, i.e. letting respondents choose multiple options that are applicable to a series of questions, rather than asking about the workshop's impact on the use of a particular tool by choosing one. This caused many overlaps, and thus sometimes **contradictive responses are observed**. For example, the first 7 responses (rows) in event_39_survey_5.csv selected "Version control (e.g. Git)" for all the following questions;

- "Post-workshop survey: Which tools/services have you started using as a result of attending the workshop?"
- "Post-workshop survey: Which tools/services are you using better than before as a result of attending the workshop?"
- "Post-workshop survey: Which tools/services are you using in the same way as before as a result of attending the workshop?"
- "Post-workshop survey: Which tools/services are you not yet using?"

In [6]:
import pandas as pd
dataset_39_5 = pd.read_csv("exported_responses/event_39_survey_5.csv")

dataset_39_5.head(n=8)

Unnamed: 0,Submitter,Submitter Email,Submission Date,Post-workshop survey: Which workshop did you attend?,Post-workshop survey: What is your current position?,Post-workshop survey: Would you judge your code to be better reusable/reproducible/modular/documented as a result of attending the workshop?,Post-workshop survey: Which tools/services have you started using as a result of attending the workshop?,Post-workshop survey: Which tools/services are you using better than before as a result of attending the workshop?,Post-workshop survey: Which tools/services are you using in the same way as before as a result of attending the workshop?,Post-workshop survey: Which tools/services are you not yet using?,Post-workshop survey: Has it become easier for you to collaborate on software development with your colleagues and collaborators?,Post-workshop survey: Have you introduced one or more of your colleagues to new tools or practices as a result of the workshop?,Post-workshop survey: What else has changed in how you write code for your research after attending a CodeRefinery workshop?,Post-workshop survey: Do you have any recommendations on how we should change the CodeRefinery curriculum?
0,,,2020-12-12 17:05:03.235413+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
1,,,2020-09-19 14:54:15.773532+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
2,,,2020-09-19 11:00:09.523825+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
3,,,2020-05-15 21:40:41.119269+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
4,,,2020-05-07 16:44:50.547252+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
5,,,2019-11-30 09:48:05.057611+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
6,,,2019-11-28 22:32:45.269666+00:00,"Lund, May 2018",Undergraduate student,More reusable,Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Version control (e.g. Git),Yes,Yes,,
7,,,2018-11-08 13:25:56.518467+00:00,"Lund, May 2018",Graduate student,More reusable; More reproducible,None of the above,Version control (e.g. Git),None of the above,Automated testing (e.g Travis CI); Code covera...,Yes,Yes,,


The `survey_analysis.ipynb` analyses the processed csv files from 2017 and 2018 surveys for each tool and thus the sum of the percentages for each alternative (started using; using better; using in a same way as before; and not using) is 100%. This cannot be true for the Indico-based surveys.

It is necessary to discuss how to solve the problems. Possible solutions would be below:
- Count the number of responses for each tool/change in practice combination and calculate the percentage based on the total responses
    - `survey_analysis.ipynb` needs revision and we will have a very different result figure for this part.
    - This solution is in a way the most honest to the obtained data, while it keeps inconsistent data, which may not be very ideal.
- Examine inconsistent responses (e.g. the same tool is chosen for >2 out of the 4 relevant questions) and make them invalid
    - This solution would yield a dataset closest to the ones from 2017 and 2018 without any judgment.
    - There is a risk that many responses are considered invalid.

(any other?)

### Items to be discussed: How to deal with the different tools shown as answer alternatives 
Due to the changes in introduced tools in the workshops, the Indico-based surveys have different answer alternatives for questions asking the workshop's impact on their use of tools. To run one analysis on all the pre-processed survey response datasets, we need change in the files from Typeform-based surveys in 2017 and 2018. We need a strategy to make a consistent data structure for this.

- Automated testing:
    - In the Typeform-based survey in 2017 and 2018, "Automated testing" and "Travis CI" are separately asked, while in the Indico-based survey, they are combined as "Automated testing (e.g. Travis CI)" or "Automated testing (e.g Travis CI or GitHub Actions or GitLab CI)" (event_194_survey_48.csv).
    - Should we consider the answers given to "Travis CI" as the same as "Automated testing (e.g. Travis CI)" (or equivalent), or "Automated testing" to be the one?
- Code review:
    - In the Indico-based survey, "Code review" is changed to "Code review (e.g. via pull requests)".
    - Can all the data for the column starting "Code review" considered as the answers to the same question?
- CMake, Workflow systems (Snakemake), and Workflow management tools (e.g. Snakemake)
    - Are they about different things or considered as asking about the same (type of) question?

### Opinions 
- NT: Given the inconsistency in both the survey structures and answer alternatives for tools between the surveys, it could be preferable to change the way to analyse the responses based on the 4 questions about the impact rather than making them based on each tool. This requires revisions on the `survey_analysis.ipynb` code cells 11-15.
    