Our dataset comes from the Stack Overflow Developer Survey, an annual survey conducted by Stack Overflow, one of the largest online platforms for developers. Each year, the survey collects responses from tens of thousands of developers worldwide, providing insights into programming languages, tools, work environments, and emerging technologies.

For example, the 2024 survey gathered responses from over 65,000 developers across 185 countries between May 19 and June 20, 2024. The survey data is publicly available under the Open Database License (ODbL).

We obtained our dataset from the official survey at survey.stackoverflow.co.

The Stack Overflow Developer Survey 2024 is highly relevant to our research because it provides comprehensive insights into the skills, career motivations, frustrations, and job satisfaction of software developers worldwide. Our research focuses on improving hiring and retention strategies in the tech industry by identifying real-world indicators of developer success beyond traditional credentials. The survey’s data validates our central argument: technical ability alone does not define a strong developer—continuous learning, problem-solving, and workplace satisfaction are equally critical factors. The findings show that 82% of developers rely on self-directed learning, underscoring the importance of adaptability over static qualifications. Furthermore, the survey highlights key retention challenges, such as technical debt (the number-one frustration) and burnout from excessive workloads, reinforcing the idea that hiring the right developers must go beyond simple coding assessments. By leveraging this dataset, we can analyze trends that impact developer hiring and retention, providing data-driven strategies to help companies predict job success, reduce turnover, and build more resilient teams.

In [12]:
import requests
import zipfile
import io
import pandas as pd

# URL of the ZIP file containing the CSV
csv_url = 'https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip'

# Download the ZIP file content
response = requests.get(csv_url)
if response.status_code == 200:
    zip_content = io.BytesIO(response.content)
    
    # Extract the ZIP file
    with zipfile.ZipFile(zip_content, 'r') as zip_ref:
        # Loop over the files in the ZIP archive
        for file_name in zip_ref.namelist():
            if file_name.endswith('.csv'):
                # Open the CSV file directly from the zip archive
                with zip_ref.open(file_name) as csv_file:
                    # Read CSV file into a DataFrame using pandas
                    df = pd.read_csv(csv_file)
                    # Display the first few rows of the DataFrame
                # Stop after processing the first CSV file
                break
else:
    print("Failed to retrieve the ZIP file.")


In [5]:
import requests
import zipfile
import io
import pandas as pd

def download_and_extract_csv(zip_url):
    """
    Downloads a ZIP file from `zip_url`, extracts the first .csv file found,
    and returns it as a Pandas DataFrame.
    """
    response = requests.get(zip_url)
    if response.status_code == 200:
        zip_content = io.BytesIO(response.content)
        
        with zipfile.ZipFile(zip_content, 'r') as zip_ref:
            for file_name in zip_ref.namelist():
                if file_name.endswith('.csv'):
                    with zip_ref.open(file_name) as csv_file:
                        df = pd.read_csv(csv_file)
                        return df
        # If there's no CSV in the ZIP, return None
        return None
    else:
        print(f"Failed to retrieve ZIP from {zip_url}")
        return None

# URLs of the ZIP files
csv_url_2024 = "https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip"
csv_url_2023 = "https://cdn.stackoverflow.co/files/jo7n4k8s/production/49915bfd46d0902c3564fd9a06b509d08a20488c.zip/stack-overflow-developer-survey-2023.zip"

# Download and load both datasets
df_2024 = download_and_extract_csv(csv_url_2024)
df_2023 = download_and_extract_csv(csv_url_2023)

# Check if both DataFrames were loaded successfully
if df_2024 is not None and df_2023 is not None:
    # Keep only columns that appear in both DataFrames
    common_cols = set(df_2024.columns).intersection(df_2023.columns)
    df_2024 = df_2024[list(common_cols)].copy()
    df_2023 = df_2023[list(common_cols)].copy()
    
    # Add a "year" column to each DataFrame
    df_2024["year"] = 2024
    df_2023["year"] = 2023
    
    # Concatenate the two DataFrames
    combined_df = pd.concat([df_2024, df_2023], ignore_index=True)
    
    # Print out some info to verify
    print("Combined DataFrame shape:", combined_df.shape)
    print(combined_df.head())
else:
    print("One or both of the DataFrames could not be loaded.")

Combined DataFrame shape: (154621, 72)
                            MiscTechHaveWorkedWith  \
0                                              NaN   
1                                              NaN   
2  .NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI   
3                     NumPy;Pandas;Ruff;TensorFlow   
4                                              NaN   

                              LanguageWantToWorkWith                 AISelect  \
0                                                NaN                      Yes   
1  Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...  No, and I don't plan to   
2                                                 C#  No, and I don't plan to   
3  HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...                      Yes   
4                 C++;HTML/CSS;JavaScript;Lua;Python  No, and I don't plan to   

  Industry OpSysProfessional use  \
0      NaN                   NaN   
1      NaN                 MacOS   
2      NaN               Windows   
3    