# EXCEED Python Qualtrics Survey - High-Level Analysis

This notebook provides a high-level analysis of the results from the Qualtrics Python Skill Level Survey. The goal of the survey was to objectively assess the Python skill levels of participants, by placing them into one of the following 2 categories: _Beginner_/_Novices_ and _Advanced_/_Experts_.

The ultimate goal of the survey was and is to identify the best subset of questions that can effectively differentiate between these two skill levels. The actual content of the survey is centered around several aspects of Python programming error messages, and debugging techniques. The underlying assumption is that individuals with more experience in Python (and perhaps even general programming experience) will have a better understanding of error identification, resolution, and debugging strategies.

The actual metrics that are captured in this high-level analysis include:
- **Completion Rate**: The percentage of participants who started and completed the survey.
- **Average duration**: The average time (in seconds) taken by participants to complete the survey.
- **Median duration**: The median time (in seconds) taken by participants to complete the survey.
- **Average Python YoE**: The average number of years of experience participants have with Python.
- **Average Programming YoE**: The average number of years of experience participants have with programming in general.
- **Average Self-reported score**: The average self-reported score of correctly answered questions (out of 16 questions).
- **Correlation between Python YoE and Programming YoE**: The Pearson correlation coefficient between the number of years of experience with Python and the number of years of experience with programming in general.
- **Correlation between self-reported score and Python YoE**: The Spearman correlation coefficient between the self-reported correct answers score and the number of years of experience with Python.
- **Correlation between self-reported score and Programming YoE**: The Spearman correlation coefficient between the self-reported correct answers score and the number of years of experience with programming in general.

> Notes:
> - The metrics are calculated based on the completed responses only, as incomplete responses do not provide meaningful data for analysis.
> - Since Python YoE and Programming YoE are essentially continuous variables, the correlation coefficients are calculated using the Pearson method.
> - The self-reported number of correctly answered questions is a discrete variable (0 - 16), so the Spearman method is used for correlation calculations.

In [1]:
import pandas as pd

# Set the path to the Qualtrics CSV file (adjust as needed, but default should be kept)
file_path = "../data/survey_results_02_06_2025.csv"

# 1. Load data (Qualtrics exports are semicolon‑delimited by default)
df = pd.read_csv(file_path, sep=';')

## Step 0: Install Required Libraries

Prior to running the analysis, ensure that the required libraries are installed. For this, you need only install the libraries defined in the `requirements.txt` file. You can do this by running the following command in your terminal:

```bash
pip install -r requirements.txt
```

In [2]:
%pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Step 1: Load the Data

We need to load the survey response data from a CSV file, prior to performing any sort of analysis. Default file name should not be changed unless you have a different file name for the survey results.

In [3]:
import pandas as pd

# Set the path to the Qualtrics CSV file (adjust as needed, but default is in the same directory)
file_path = "../data/survey_results_02_06_2025.csv"

# 1. Load data (Qualtrics exports are semicolon‑delimited by default)
df = pd.read_csv(file_path, sep=';')

## Step 2: Keep Only Completed Surveys

In the dataset, it is possible to encounter entries of respondents that have started but not completed the survey within the allowed 2 weeks timeframe on Qualtrics. The following step filters out these incomplete surveys, ensuring that only fully completed responses are analyzed.

In [4]:
# 2. Keep only respondents who finished the survey
completed = df[df["Finished"] == 1]

## Step 3: Analyze the Data

This step calculates the metrics defined in the introduction to this notebook.

In [5]:
# 3. Calculate the required metrics
metrics = {
    "Total number of responses": len(df),
    "Total number of completed responses": len(completed),
    "Completion rate (%)": round(df["Finished"].mean() * 100, 2),
    "Average duration (sec)": completed["Duration (in seconds)"].mean(),
    "Median duration (sec)": completed["Duration (in seconds)"].median(),
    "Average Python experience (years)": completed["Q2.2"].mean(),
    "Average general experience (years)": completed["Q2.3"].mean(),
    "Average estimated correct answers": completed["Q11.1"].mean(),
    # Some correlations
    "Pearson Correlation (Python YoE and General programming YoE)": completed[["Q2.2", "Q2.3"]].corr().iloc[0, 1],
    "Spearman Correlation (Python YoE and self-reported score)":
        completed[["Q2.2", "Q11.1"]].corr(method='spearman').iloc[0, 1],
    "Spearman Correlation (General programming YoE and self-reported score)":
        completed[["Q2.3", "Q11.1"]].corr(method='spearman').iloc[0, 1]
}

## Step 4: Generate the Report

This step creates a summary report of the analysis results. More specifically, it presents the metrics calculated in the previous step in a tidy table format, which can be easily interpreted and shared.

In [6]:
from IPython.display import display, HTML

# 4. Present results in a tidy table
summary_df = pd.DataFrame(list(metrics.items()), columns=["Metric", "Value"])

display(HTML("<h2>Qualtrics Python Skill Level Assessment - High-level Metrics</h2>"))
display(summary_df.style.format(precision=2))

Unnamed: 0,Metric,Value
0,Total number of responses,78.0
1,Total number of completed responses,60.0
2,Completion rate (%),76.92
3,Average duration (sec),26528.42
4,Median duration (sec),986.5
5,Average Python experience (years),5.7
6,Average general experience (years),12.77
7,Average estimated correct answers,12.22
8,Pearson Correlation (Python YoE and General programming YoE),0.55
9,Spearman Correlation (Python YoE and self-reported score),0.24
