# Task B: Stack Overflow Developer Survey 2025 Analytics

You are provided with the latest developer survey results from Stack Overflow. Your task is to perform analytics on the survey to extract insights about the programming industry.

## Setup
If you are in google colab, you should just be able to run the cell below. Otherwise find the conda `environment.yml` file provided with all the dependencies.

In [3]:
from numpy.distutils.system_info import language_map
%pip install pandas
import pandas as pd

ModuleNotFoundError: No module named 'distutils.msvccompiler'

## Reading the data

Find a utility class below to read the data for you.

In [4]:
import csv
from typing import List, Dict, Any, Optional
from pathlib import Path

RESPONSE_ID_FIELD_NAME = "ResponseId"
QUESTION_ID_FIELD_NAME = "qid"

class SurveyDataReader:
    """
    Read and process Stack Overflow Developer Survey data.
    """

    def __init__(self, schema_file: str, data_file: str):
        self.schema = self._parse_schema(schema_file)
        self.data = self._parse_data(data_file)

    def _parse_schema(self, schema_file: str) -> List[Dict[str, str]]:
        schema = []
        schema_path = Path(schema_file).resolve()
        with open(schema_path, mode="r") as file:
            reader = csv.DictReader(file)
            schema = [row for row in reader]
        return schema

    def _parse_data(self, data_file: str) -> List[Dict[str, Any]]:
        data = []
        data_path = Path(data_file).resolve()
        with open(data_path, mode="r") as file:
            reader = csv.DictReader(file)
            data = [row for row in reader]
        return data

    def get_schema(self) -> List[Dict[str, str]]:
        return self.schema

    def get_data(self) -> List[Dict[str, Any]]:
        return self.data

    def get_question_by_id(self, qid: str) -> Optional[Dict[str, str]]:
        for question in self.schema:
            if question[QUESTION_ID_FIELD_NAME] == qid:
                return question
        return None

    def get_responses_for_question(self, qname: str) -> List[Any]:
        return [response[qname] for response in self.data if qname in response]

    def get_response_by_id(self, response_id: str | int) -> Optional[Dict[str, Any]]:
        response_id_str = str(response_id)
        for response in self.data:
            if response[RESPONSE_ID_FIELD_NAME] == response_id_str:
                return response
        return None

## Getting to know the data reader

In [5]:
SURVEY_SUBDIR = "stack-overflow-developer-survey-2025"
SCHEMA_RELATIVE_PATH = f"{SURVEY_SUBDIR}/survey_results_schema.csv"
DATA_RELATIVE_PATH = f"{SURVEY_SUBDIR}/survey_results_public_cleaned.csv"

reader = SurveyDataReader(SCHEMA_RELATIVE_PATH, DATA_RELATIVE_PATH)

In [6]:
#print(reader.get_schema())

#print(len(reader.get_data()))

print(reader.get_data()[0:10]) # Be careful when trying to output the data, there's lots of it!

[{'ResponseId': '1', 'MainBranch': 'I am a developer by profession', 'Age': '25-34 years old', 'EdLevel': 'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)', 'Employment': 'Employed', 'EmploymentAddl': 'Caring for dependents (children, elderly, etc.)', 'WorkExp': '8', 'LearnCodeChoose': 'Yes, I am not new to coding but am learning new coding techniques or programming language', 'LearnCode': 'Online Courses or Certification (includes all media types);Other online resources (e.g. standard search, forum, online community)', 'LearnCodeAI': 'Yes, I learned how to use AI-enabled tools for my personal curiosity and/or hobbies', 'AILearnHow': 'AI CodeGen tools or AI-enabled apps', 'YearsCode': '14', 'DevType': 'Developer, mobile', 'OrgSize': '20 to 99 employees', 'ICorPM': 'People manager', 'RemoteWork': 'Remote', 'PurchaseInfluence': 'Yes, I influenced the purchase of a substantial addition to the tech stack', 'TechEndorseIntro': 'Work', 'TechEndorse_1': '10', 'TechEndorse_2': '7', 'TechEndors

## Questions

1. Print all of the questions asked in the developer survey

In [7]:
QID = '\ufeff"qid"'

q_set = set()
count = 1

for q in reader.get_schema():
    if q[QID] not in q_set:
        q_set.add(q[QID])

        print(f"Q{count}: ", end="")
        print(q["question"])
        print()

        count += 1


Q1: What attracts you to a technology or causes you to endorse it (most to least important)?

Q2: What would turn you off or cause you to reject it (most to least important)?

Q3: Rank the following attributes of your current professional job in technology according to those that contribute your job satisfaction so that the first is the most important, last is least important (if you just started a new job, consider the job you spent the most time at in the past year):

Q4: When visiting Stack Overflow, which following activities are you most interested in?  Please rank the following so the first activity is what most interests you and last is your least interested activity.

Q5: Are you someone who writes code? Please select one of the following options that best describes you today.

Q6: What is your age?

Q7: Which of the following best describes the highest level of formal education that you’ve completed? 

Q8: Which of the following best describes your current employment status?<b

2. Which age range has the most responses in the survey?

In [8]:
age_map = dict()

for r in reader.get_data():
    if r['Age'] not in age_map:
        age_map[r['Age']] = 0

    age_map[r['Age']] += 1

print(max(age_map, key=age_map.get))

#print(age_map)

25-34 years old


3. How many survey respondents do we know definitely work for a company larger than Marshall Wace? (Feel free to ask one of us if you don't know how large Marshall Wace is!)

In [9]:
MW_head_count = 800

larger_than_MW = 0

for r in reader.get_data():
    size = r["OrgSize"]

    if size.startswith("NA") or size.startswith("Less") or size.startswith("I") or size.startswith("Just"):
        continue

    lower_bound = int(size.split(" ")[0].replace(",", ""))

    if lower_bound > MW_head_count:
        larger_than_MW += 1


print(larger_than_MW)

10715


4. How many survey respondents had less than 1 year of coding experience before (or outside of) coding for their profession?

In [10]:
no_experience_count = 0

for r in reader.get_data():
    if r["YearsCode"] == r["WorkExp"] and r["YearsCode"] != "NA":
        no_experience_count += 1

print(no_experience_count)

5267


5. Of the people who had 1 or more years of coding experience outside of coding professionally, what is the average number of years they spent coding outside of work? For simplicity, you can consider only the people who have given an exact number of years they have spent coding in both columns (i.e. excluding those with over 50 or less than 1 year)

In [11]:
total = 0
count = 0

for r in reader.get_data():
    if r["YearsCode"] != r["WorkExp"] and r["YearsCode"] != "NA" and r["WorkExp"] != "NA":
        outside_years = int(r["YearsCode"]) - int(r["WorkExp"])

        if outside_years > 50:
            continue

        total += outside_years
        count += 1

print(total / count)

4.3369423616941205


6. What is the median annual total compensation of those who specified their compensation in USD

In [33]:
import bisect

# Data contains extreme values for compensation, so I will exclude extreme 10% (5% on either side)

values = []

count = 0

for r in reader.get_data():
    if r["Currency"].startswith("USD") and r["CompTotal"] != "NA":

        count += 1
        comp = float(r["CompTotal"])

        if comp < 1_000.0 or comp > 1_000_000.0:
            continue

        bisect.insort(values, comp)

drop_count = len(values) // 20

values = values[drop_count:]

values = values[:-drop_count]

mid = len(values) // 2

if len(values) % 2 == 0:
    print((values[mid - 1] + values[mid]) / 2)
else:
    print(values[mid])

136000.0


7. Which programming language has respondents with the highest annual compensation in USD? If a response lists multiple languages, you can attribute the compensation to each language in the response.

In [32]:
import bisect

language_map = dict()

# Data contains extreme values for compensation, so I will exclude extreme 10% (5% on either side)
# I also used median compensation

for r in reader.get_data():
    if r["Currency"].startswith("USD") and r["CompTotal"] != "NA":
        comp = float(r["CompTotal"])

        for l in filter(lambda x: x != "NA", r["LanguageHaveWorkedWith"].split(";")):
            if l not in language_map:
                language_map[l] = []

            bisect.insort(language_map[l], comp)

result_dict = dict()

for key, data in language_map.items():
    mid = len(data) // 2
    median = 0

    drop_count = len(data) // 20

    data = data[drop_count:]

    data = data[:-drop_count]

    if len(data) % 2 == 0:
        median = (data[mid - 1] + data[mid]) / 2
    else:
        median = data[mid]

    result_dict[key] = median

print([(key, value) for key, value in result_dict.items()])
print()
print(max(result_dict, key=result_dict.get))

[('C', 134500.0), ('C#', 139600.0), ('C++', 140000.0), ('Delphi', 101000.0), ('HTML/CSS', 140000.0), ('Java', 150000.0), ('JavaScript', 140000.0), ('Lua', 150000.0), ('PowerShell', 136000.0), ('Python', 150000.0), ('SQL', 145000.0), ('TypeScript', 150000.0), ('VBA', 120000.0), ('Visual Basic (.Net)', 117600.0), ('Scala', 200000.0), ('Bash/Shell (all shells)', 150000.0), ('Kotlin', 150000.0), ('Rust', 155500.0), ('Swift', 154500.0), ('Elixir', 172500.0), ('Ada', 100000.0), ('Assembly', 120000.0), ('COBOL', 100000.0), ('Fortran', 124000.0), ('Ruby', 167954.0), ('PHP', 110000.0), ('MicroPython', 106000.0), ('F#', 130000.0), ('Go', 170000.0), ('Perl', 150000.0), ('MATLAB', 115000.0), ('Dart', 120000.0), ('Groovy', 162000.0), ('Prolog', 140000.0), ('R', 125000.0), ('OCaml', 120000.0), ('GDScript', 129500.0), ('Gleam', 124000.0), ('Mojo', 75000.0), ('Zig', 130000.0), ('Erlang', 165000.0), ('Lisp', 149000.0)]

Scala


## Bonus Task: SurveyDataReader

`SurveyDataReader` is a basic class that allows you to access the underlying survey data programmatically. The class is implemented with basic data structures and no external dependencies hence there is plenty of room for optimisation. Try to improve the speed of basic operations and add some of your own by potentially leveraging a package such as [NumPy](https://numpy.org/).