Our dataset comes from the Stack Overflow Developer Survey, an annual survey conducted by Stack Overflow, one of the largest online platforms for developers. Each year, the survey collects responses from tens of thousands of developers worldwide, providing insights into programming languages, tools, work environments, and emerging technologies.

For example, the 2024 survey gathered responses from over 65,000 developers across 185 countries between May 19 and June 20, 2024. The survey data is publicly available under the Open Database License (ODbL).

We obtained our dataset from the official survey at survey.stackoverflow.co.

The Stack Overflow Developer Survey 2024 is highly relevant to our research because it provides comprehensive insights into the skills, career motivations, frustrations, and job satisfaction of software developers worldwide. Our research focuses on improving hiring and retention strategies in the tech industry by identifying real-world indicators of developer success beyond traditional credentials. The survey’s data validates our central argument: technical ability alone does not define a strong developer—continuous learning, problem-solving, and workplace satisfaction are equally critical factors. The findings show that 82% of developers rely on self-directed learning, underscoring the importance of adaptability over static qualifications. Furthermore, the survey highlights key retention challenges, such as technical debt (the number-one frustration) and burnout from excessive workloads, reinforcing the idea that hiring the right developers must go beyond simple coding assessments. By leveraging this dataset, we can analyze trends that impact developer hiring and retention, providing data-driven strategies to help companies predict job success, reduce turnover, and build more resilient teams.

In [13]:
import requests
import zipfile
import io
import pandas as pd

def download_and_extract_csv(zip_url):
    """
    Downloads a ZIP file from `zip_url`, extracts the first .csv file found,
    and returns it as a Pandas DataFrame.
    """
    response = requests.get(zip_url)
    if response.status_code == 200:
        zip_content = io.BytesIO(response.content)
        
        with zipfile.ZipFile(zip_content, 'r') as zip_ref:
            for file_name in zip_ref.namelist():
                if file_name.endswith('.csv'):
                    with zip_ref.open(file_name) as csv_file:
                        df = pd.read_csv(csv_file)
                        return df
        # If no CSV is found in the ZIP, return None
        return None
    else:
        print(f"Failed to retrieve ZIP from {zip_url}")
        return None

# URLs of the ZIP files
csv_url_2024 = "https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip"
csv_url_2023 = "https://cdn.stackoverflow.co/files/jo7n4k8s/production/49915bfd46d0902c3564fd9a06b509d08a20488c.zip/stack-overflow-developer-survey-2023.zip"
csv_url_2022 = "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2022.zip"
# Download and load each dataset
odf_2024 = download_and_extract_csv(csv_url_2024)
odf_2023 = download_and_extract_csv(csv_url_2023)
odf_2022 = download_and_extract_csv(csv_url_2022)

# Check if all DataFrames were loaded successfully
if odf_2024 is not None and odf_2023 is not None and odf_2022 is not None:
    print("All DataFrames loaded successfully.")
else:
    print("One or more DataFrames could not be loaded. Please check the URLs or network connection.")

All DataFrames loaded successfully.


In [14]:
odf_2024.columns = odf_2024.columns.str.upper().str.replace(" ", "")
odf_2023.columns = odf_2023.columns.str.upper().str.replace(" ", "")
odf_2022.columns = odf_2022.columns.str.upper().str.replace(" ", "")

with pd.option_context('display.max_rows', None):
    print(odf_2024.dtypes)

RESPONSEID                          int64
MAINBRANCH                         object
AGE                                object
EMPLOYMENT                         object
REMOTEWORK                         object
CHECK                              object
CODINGACTIVITIES                   object
EDLEVEL                            object
LEARNCODE                          object
LEARNCODEONLINE                    object
TECHDOC                            object
YEARSCODE                          object
YEARSCODEPRO                       object
DEVTYPE                            object
ORGSIZE                            object
PURCHASEINFLUENCE                  object
BUYNEWTOOL                         object
BUILDVSBUY                         object
TECHENDORSE                        object
COUNTRY                            object
CURRENCY                           object
COMPTOTAL                         float64
LANGUAGEHAVEWORKEDWITH             object
LANGUAGEWANTTOWORKWITH            

In [15]:
with pd.option_context('display.max_columns', None,
                       'display.max_rows', None,
                       'display.width', 6000):
    print(odf_2024.head(20))



    RESPONSEID                                         MAINBRANCH                 AGE                                         EMPLOYMENT                            REMOTEWORK   CHECK                                   CODINGACTIVITIES                                            EDLEVEL                                          LEARNCODE                                    LEARNCODEONLINE                                            TECHDOC YEARSCODE YEARSCODEPRO                  DEVTYPE ORGSIZE PURCHASEINFLUENCE BUYNEWTOOL BUILDVSBUY TECHENDORSE                                            COUNTRY CURRENCY  COMPTOTAL                             LANGUAGEHAVEWORKEDWITH                             LANGUAGEWANTTOWORKWITH                                    LANGUAGEADMIRED                             DATABASEHAVEWORKEDWITH                   DATABASEWANTTOWORKWITH                          DATABASEADMIRED                             PLATFORMHAVEWORKEDWITH                             PLATFORMWANTTOWORK

In [16]:
#DROPPING
#This question is a simple check tom make sure you are paying attention to the survey

# Filter rows where the "Check" column equals "Apples"
odf_2024 = odf_2024[odf_2024['CHECK'] == "Apples"]

# Drop the columns (note the column names are now capitalized)
cols_to_drop = [
    'CHECK', 'TECHDOC', 'PURCHASEINFLUENCE', 'BUYNEWTOOL', 'BUILDVSBUY',
    'TECHENDORSE', 'LANGUAGEADMIRED', 'DATABASEADMIRED', 'PLATFORMADMIRED',
    'WEBFRAMEADMIRED', 'EMBEDDEDADMIRED', 'TOOLSTECHADMIRED', 'MISCTECHADMIRED',
    'NEWCOLLABTOOLSADMIRED', 'OFFICESTACKASYNCADMIRED', 'OFFICESTACKSYNCHAVEWORKEDWITH',
    'OFFICESTACKSYNCWANTTOWORKWITH', 'AISEARCHDEVADMIRED', 'SOCOMM', 'KNOWLEDGE_5',
    'KNOWLEDGE_6', 'KNOWLEDGE_8', 'KNOWLEDGE_9', 'SURVEYLENGTH', 'SURVEYEASE',
    'CURRENCY', 'COMPTOTAL'
]
odf_2024 = odf_2024.drop(columns=cols_to_drop)

print(odf_2024.shape)


(65437, 87)


In [17]:
# Identify columns that are common to all three DataFrames
common_cols = set(odf_2024.columns).intersection(odf_2023.columns).intersection(odf_2022.columns)

# Subset each DataFrame to these common columns
df_2024 = odf_2024[list(common_cols)].copy()
df_2023 = odf_2023[list(common_cols)].copy()
df_2022 = odf_2022[list(common_cols)].copy()

# Add a "year" column to each DataFrame
df_2024["year"] = 2024
df_2023["year"] = 2023
df_2022["year"] = 2022

# Concatenate all DataFrames
combined_df = pd.concat([df_2024, df_2023, df_2022], ignore_index=True)

# Print out some info to verify
print("Combined DataFrame shape:", combined_df.shape)
print(combined_df.head())

# Print column data types and other info (optional)
with pd.option_context('display.max_rows', None):
    print(combined_df.dtypes)

Combined DataFrame shape: (227889, 52)
   RESPONSEID            WEBFRAMEWANTTOWORKWITH  \
0           1                               NaN   
1           2  Express;Htmx;Node.js;React;Remix   
2           3                      ASP.NET CORE   
3           4      jQuery;Next.js;Node.js;React   
4           5                               NaN   

                       TOOLSTECHWANTTOWORKWITH  \
0                                          NaN   
1  Docker;Homebrew;Kubernetes;npm;Vite;Webpack   
2                                      MSBuild   
3                        Docker;Kubernetes;npm   
4                                     APT;Make   

                      OFFICESTACKASYNCHAVEWORKEDWITH                  ICORPM  \
0                                                NaN                     NaN   
1                                                NaN  Individual contributor   
2                                                NaN                     NaN   
3                                

In [18]:
#CLEAN
def clean_and_split(val):
    if isinstance(val, str):
        if ';' in val:
            val = val.lower()
            return [item.strip() for item in val.split(';')]
        else:
            val = val.lower()
            return val.strip()
    return val
odf_2024 = odf_2024.applymap(clean_and_split)
combined_df = combined_df.applymap(clean_and_split)

  odf_2024 = odf_2024.applymap(clean_and_split)
  combined_df = combined_df.applymap(clean_and_split)


In [19]:
with pd.option_context('display.max_columns', None,
                       'display.max_rows', None,
                       'display.width', 6000):
    print(odf_2024.head(20))


    RESPONSEID                                         MAINBRANCH                 AGE                                         EMPLOYMENT                            REMOTEWORK                                   CODINGACTIVITIES                                            EDLEVEL                                          LEARNCODE                                    LEARNCODEONLINE YEARSCODE YEARSCODEPRO                  DEVTYPE ORGSIZE                                            COUNTRY                             LANGUAGEHAVEWORKEDWITH                             LANGUAGEWANTTOWORKWITH                             DATABASEHAVEWORKEDWITH                        DATABASEWANTTOWORKWITH                             PLATFORMHAVEWORKEDWITH                             PLATFORMWANTTOWORKWITH                             WEBFRAMEHAVEWORKEDWITH                             WEBFRAMEWANTTOWORKWITH       EMBEDDEDHAVEWORKEDWITH EMBEDDEDWANTTOWORKWITH                             MISCTECHHAVEWORKEDWITH         