Our dataset comes from the Stack Overflow Developer Survey, an annual survey conducted by Stack Overflow, one of the largest online platforms for developers. Each year, the survey collects responses from tens of thousands of developers worldwide, providing insights into programming languages, tools, work environments, and emerging technologies.

For example, the 2024 survey gathered responses from over 65,000 developers across 185 countries between May 19 and June 20, 2024. The survey data is publicly available under the Open Database License (ODbL).

We obtained our dataset from the official survey at survey.stackoverflow.co.

The Stack Overflow Developer Survey 2024 is highly relevant to our research because it provides comprehensive insights into the skills, career motivations, frustrations, and job satisfaction of software developers worldwide. Our research focuses on improving hiring and retention strategies in the tech industry by identifying real-world indicators of developer success beyond traditional credentials. The survey’s data validates our central argument: technical ability alone does not define a strong developer—continuous learning, problem-solving, and workplace satisfaction are equally critical factors. The findings show that 82% of developers rely on self-directed learning, underscoring the importance of adaptability over static qualifications. Furthermore, the survey highlights key retention challenges, such as technical debt (the number-one frustration) and burnout from excessive workloads, reinforcing the idea that hiring the right developers must go beyond simple coding assessments. By leveraging this dataset, we can analyze trends that impact developer hiring and retention, providing data-driven strategies to help companies predict job success, reduce turnover, and build more resilient teams.

In [10]:
import requests
import zipfile
import io
import pandas as pd

def download_and_extract_csv(zip_url):
    """
    Downloads a ZIP file from `zip_url`, extracts the first .csv file found,
    and returns it as a Pandas DataFrame.
    """
    response = requests.get(zip_url)
    if response.status_code == 200:
        zip_content = io.BytesIO(response.content)
        
        with zipfile.ZipFile(zip_content, 'r') as zip_ref:
            for file_name in zip_ref.namelist():
                if file_name.endswith('.csv'):
                    with zip_ref.open(file_name) as csv_file:
                        df = pd.read_csv(csv_file)
                        return df
        # If no CSV is found in the ZIP, return None
        return None
    else:
        print(f"Failed to retrieve ZIP from {zip_url}")
        return None

# URLs of the ZIP files
csv_url_2024 = "https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip"
csv_url_2023 = "https://cdn.stackoverflow.co/files/jo7n4k8s/production/49915bfd46d0902c3564fd9a06b509d08a20488c.zip/stack-overflow-developer-survey-2023.zip"
csv_url_2022 = "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2022.zip"
# Download and load each dataset
odf_2024 = download_and_extract_csv(csv_url_2024)
odf_2023 = download_and_extract_csv(csv_url_2023)
odf_2022 = download_and_extract_csv(csv_url_2022)

# Check if all DataFrames were loaded successfully
if odf_2024 is not None and odf_2023 is not None and odf_2022 is not None:
    print("All DataFrames loaded successfully.")
else:
    print("One or more DataFrames could not be loaded. Please check the URLs or network connection.")

All DataFrames loaded successfully.


In [11]:
odf_2024.columns = odf_2024.columns.str.upper().str.replace(" ", "")
odf_2023.columns = odf_2023.columns.str.upper().str.replace(" ", "")
odf_2022.columns = odf_2022.columns.str.upper().str.replace(" ", "")

with pd.option_context('display.max_rows', None):
    print(odf_2024.dtypes)

RESPONSEID                          int64
MAINBRANCH                         object
AGE                                object
EMPLOYMENT                         object
REMOTEWORK                         object
CHECK                              object
CODINGACTIVITIES                   object
EDLEVEL                            object
LEARNCODE                          object
LEARNCODEONLINE                    object
TECHDOC                            object
YEARSCODE                          object
YEARSCODEPRO                       object
DEVTYPE                            object
ORGSIZE                            object
PURCHASEINFLUENCE                  object
BUYNEWTOOL                         object
BUILDVSBUY                         object
TECHENDORSE                        object
COUNTRY                            object
CURRENCY                           object
COMPTOTAL                         float64
LANGUAGEHAVEWORKEDWITH             object
LANGUAGEWANTTOWORKWITH            

In [12]:
with pd.option_context('display.max_columns', None,
                       'display.max_rows', None,
                       'display.width', 6000):
    print(odf_2024.head(20))



    RESPONSEID                                         MAINBRANCH                 AGE                                         EMPLOYMENT                            REMOTEWORK   CHECK                                   CODINGACTIVITIES                                            EDLEVEL                                          LEARNCODE                                    LEARNCODEONLINE                                            TECHDOC YEARSCODE YEARSCODEPRO                  DEVTYPE ORGSIZE PURCHASEINFLUENCE BUYNEWTOOL BUILDVSBUY TECHENDORSE                                            COUNTRY CURRENCY  COMPTOTAL                             LANGUAGEHAVEWORKEDWITH                             LANGUAGEWANTTOWORKWITH                                    LANGUAGEADMIRED                             DATABASEHAVEWORKEDWITH                   DATABASEWANTTOWORKWITH                          DATABASEADMIRED                             PLATFORMHAVEWORKEDWITH                             PLATFORMWANTTOWORK

In [13]:
#DROPPING
#This question is a simple check tom make sure you are paying attention to the survey

# Filter rows where the "Check" column equals "Apples"
odf_2024 = odf_2024[odf_2024['CHECK'] == "Apples"]

# Drop the columns (note the column names are now capitalized)
cols_to_drop = [
    'CHECK',
    'TECHDOC',
    'PURCHASEINFLUENCE',
    'BUYNEWTOOL',
    'BUILDVSBUY',
    'TECHENDORSE',
    'LANGUAGEADMIRED',
    'LANGUAGEWANTTOWORKWITH',
    'DATABASEADMIRED',
    'DATABASEWANTTOWORKWITH',
    'PLATFORMADMIRED',
    'PLATFORMWANTTOWORKWITH',
    'WEBFRAMEADMIRED',
    'WEBFRAMEWANTTOWORKWITH',
    'EMBEDDEDADMIRED',
    'EMBEDDEDWANTTOWORKWITH',
    'TOOLSTECHADMIRED',
    'TOOLSTECHWANTTOWORKWITH',
    'MISCTECHADMIRED',
    'MISCTECHWANTTOWORKWITH',
    'NEWCOLLABTOOLSADMIRED',
    'NEWCOLLABTOOLSWANTTOWORKWITH',
    'OFFICESTACKASYNCADMIRED',
    'OFFICESTACKASYNCWANTTOWORKWITH',
    'OFFICESTACKSYNCHAVEWORKEDWITH',
    'AISEARCHDEVADMIRED',
    'AISEARCHDEVWANTTOWORKWITH',
    'SOCOMM',
    'KNOWLEDGE_5',
    'KNOWLEDGE_6',
    'KNOWLEDGE_8',
    'KNOWLEDGE_9',
    'SURVEYLENGTH',
    'SURVEYEASE',
    'CURRENCY',
    'COMPTOTAL',
    'RESPONSEID'
]

odf_2024 = odf_2024.drop(columns=cols_to_drop)

print(odf_2024.shape)


(65437, 77)


In [14]:
# Identify columns that are common to all three DataFrames
common_cols = set(odf_2024.columns).intersection(odf_2023.columns).intersection(odf_2022.columns)

# Subset each DataFrame to these common columns
df_2024 = odf_2024[list(common_cols)].copy()
df_2023 = odf_2023[list(common_cols)].copy()
df_2022 = odf_2022[list(common_cols)].copy()

# Add a "year" column to each DataFrame
df_2024["year"] = 2024
df_2023["year"] = 2023
df_2022["year"] = 2022

# Concatenate all DataFrames
combined_df = pd.concat([df_2024, df_2023, df_2022], ignore_index=True)

# Print out some info to verify
print("Combined DataFrame shape:", combined_df.shape)
print(combined_df.head())

# Print column data types and other info (optional)
with pd.option_context('display.max_rows', None):
    print(combined_df.dtypes)

Combined DataFrame shape: (227889, 44)
                       MAINBRANCH  \
0  I am a developer by profession   
1  I am a developer by profession   
2  I am a developer by profession   
3           I am learning to code   
4  I am a developer by profession   

                              LANGUAGEHAVEWORKEDWITH  \
0                                                NaN   
1  Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...   
2                                                 C#   
3  C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...   
4            C++;HTML/CSS;JavaScript;Lua;Python;Rust   

                                             EDLEVEL  \
0                          Primary/elementary school   
1       Bachelor’s degree (B.A., B.S., B.Eng., etc.)   
2    Master’s degree (M.A., M.S., M.Eng., MBA, etc.)   
3  Some college/university study without earning ...   
4  Secondary school (e.g. American high school, G...   

                      OFFICESTACKASYNCHAVEWORKEDWITH FREQUENCY_2

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

def encode_multi_select_columns(df):
    """
    This function:
      - Detects object-type columns with semicolon-separated values.
      - Splits these values into lists.
      - One-hot encodes the lists using MultiLabelBinarizer.
      - Drops the original columns and joins the new binary columns.
      - Converts column names to uppercase and drops any columns containing '_OTHER'.
      
    Parameters:
      df (pd.DataFrame): The input DataFrame.
      
    Returns:
      pd.DataFrame: The DataFrame after one-hot encoding multi-select columns.
    """
    print(f'Number of columns before encoding: {df.shape[1]}')
    
    # Determine which columns include multiple labels (semicolon-separated strings)
    multi_select_cols = []
    for col in df.select_dtypes(include='object').columns:
        if df[col].dropna().astype(str).str.contains(';').any():
            multi_select_cols.append(col)
    
    # One-hot encode the detected columns using MultiLabelBinarizer
    for col in multi_select_cols:
        # Get the raw column as strings, filling NaN with ''
        raw_col = df[col].fillna('').astype(str)
        
        # Split and strip each value into a list
        split_col = raw_col.apply(lambda x: [item.strip() for item in x.split(';')] if x else [])
        
        # Fit and transform with MultiLabelBinarizer
        mlb = MultiLabelBinarizer()
        encoded = mlb.fit_transform(split_col)
        
        # Create a DataFrame with meaningful column names
        encoded_df = pd.DataFrame(encoded, 
                                  columns=[f"{col}_{cls}" for cls in mlb.classes_],
                                  index=df.index)
        
        # Drop the original column before joining
        df.drop(columns=[col], inplace=True)
        
        # Join encoded columns back to the original DataFrame
        df = df.join(encoded_df)
    
    print(f'Number of columns after encoding: {df.shape[1]}')
    
    # Convert all column names to uppercase
    df.columns = df.columns.str.upper()
    
    # Drop columns that contain '_OTHER'
    df = df.drop(columns=[col for col in df.columns if '_OTHER' in col])
    
    print(f'Number of columns after dropping _OTHER: {df.shape[1]}')
    
    return df

encoded_2024 = encode_multi_select_columns(odf_2024)


Number of columns before encoding: 77
Number of columns after encoding: 648
Number of columns after dropping _OTHER: 628
                       MAINBRANCH                 AGE REMOTEWORK  \
0  I am a developer by profession  Under 18 years old     Remote   
1  I am a developer by profession     35-44 years old     Remote   
2  I am a developer by profession     45-54 years old     Remote   
3           I am learning to code     18-24 years old        NaN   
4  I am a developer by profession     18-24 years old        NaN   

                                             EDLEVEL YEARSCODE YEARSCODEPRO  \
0                          Primary/elementary school       NaN          NaN   
1       Bachelor’s degree (B.A., B.S., B.Eng., etc.)        20           17   
2    Master’s degree (M.A., M.S., M.Eng., MBA, etc.)        37           27   
3  Some college/university study without earning ...         4          NaN   
4  Secondary school (e.g. American high school, G...         9          NaN

In [None]:
with pd.option_context('display.max_columns', None,
                       'display.max_rows', None,
                       'display.width', 6000):
    print(encoded_2024.head(20))


                                           MAINBRANCH                 AGE                            REMOTEWORK                                            EDLEVEL YEARSCODE YEARSCODEPRO                  DEVTYPE ORGSIZE                                            COUNTRY                          SOVISITFREQ SOACCOUNT                           SOPARTFREQ                 AISELECT          AISENT                       AIACC                                      AICOMPLEX      AITHREAT TBRANCH                  ICORPM  WORKEXP     KNOWLEDGE_1                 KNOWLEDGE_2                 KNOWLEDGE_3 KNOWLEDGE_4 KNOWLEDGE_7       FREQUENCY_1        FREQUENCY_2        FREQUENCY_3         TIMESEARCHING         TIMEANSWERING                   PROFESSIONALCLOUD              PROFESSIONALQUESTION                    INDUSTRY  JOBSATPOINTS_1  JOBSATPOINTS_4  JOBSATPOINTS_5  JOBSATPOINTS_6  JOBSATPOINTS_7  JOBSATPOINTS_8  JOBSATPOINTS_9  JOBSATPOINTS_10  JOBSATPOINTS_11  CONVERTEDCOMPYEARLY  JOBSAT  EMPLO

In [None]:
# Create a new DataFrame dropping rows with NaN in CONVERTEDCOMPYEARLY
df_comp = encoded_2024.dropna(subset=['CONVERTEDCOMPYEARLY'])
print("New DataFrame shape:", df_comp.shape)



New DataFrame shape: (23435, 628)


In [20]:
import pandas as pd

# Ensure df_comp is your DataFrame with non-NaN compensation values in 'CONVERTEDCOMPYEARLY'
range_dict = {}

# Iterate over each column in df_comp
for col in df_comp.columns:
    # Check if the column is string-based (object type)
    if df_comp[col].dtype == 'object':
        # Drop rows where either the column or CONVERTEDCOMPYEARLY is NaN
        df_temp = df_comp.dropna(subset=[col, 'CONVERTEDCOMPYEARLY'])
        # Group by the column and compute the mean compensation for each group
        grouped_means = df_temp.groupby(col)['CONVERTEDCOMPYEARLY'].mean()
        # Only calculate the range if there is more than one group
        if len(grouped_means) > 1:
            comp_range = grouped_means.max() - grouped_means.min()
            range_dict[col] = comp_range

# Convert the dictionary to a pandas Series and sort by the range in descending order
range_series = pd.Series(range_dict).sort_values(ascending=False)
print("Columns sorted by compensation range (largest to smallest):")
print(range_series[:10])


Columns sorted by compensation range (largest to smallest):
COUNTRY                 1.999999e+06
DEVTYPE                 2.071312e+05
YEARSCODEPRO            1.782993e+05
YEARSCODE               1.297780e+05
AGE                     1.261641e+05
AIACC                   1.036092e+05
AICOMPLEX               8.214306e+04
SOPARTFREQ              7.560119e+04
PROFESSIONALQUESTION    7.447921e+04
EDLEVEL                 7.196168e+04
dtype: float64
