# Analyzing Stackoverflow Developer Survey Data 2024

This notebook is structured according to CRISP-DM:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Modeling
5. Result Evaluation
6. Deployment

## 1. Business Understanding

In the following Stackoverflow Developer Survey Data from 2024 will be analysed.

The main questions I try to answer are:

**1. Which technologies are the most popular and widely used?**

- Are there differences depending on the region?
- Are there differences based on the experience of developers or developer groups?
- Which technologies are currently being used the most, and which ones would the participants like to work with in the future?

**2. Which technologies currently offer the highest salaries?**

- Are there differences depending on the country/region?

**3. What developer profiles exist, and how do they differ from each other?**

## 2. Data Understanding

In [25]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
# read csv data
df = pd.read_csv("survey_results_public.csv")
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


In [27]:
df.shape
print(f"The data set has {df.shape[0]} rows and {df.shape[1]} columns")

The data set has 65437 rows and 114 columns


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Columns: 114 entries, ResponseId to JobSat
dtypes: float64(13), int64(1), object(100)
memory usage: 56.9+ MB


In [29]:
df.describe(include="all")

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
count,65437.0,65437,65437,65437,54806,65437,54466,60784,60488,49237,...,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,56182,56238,23435.0,29126.0
unique,,5,8,110,3,1,118,8,418,10853,...,,,,,,,3,3,,
top,,I am a developer by profession,25-34 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Appropriate in length,Easy,,
freq,,50207,23911,39041,23015,65437,9993,24942,3674,603,...,,,,,,,38767,30071,,
mean,32719.0,,,,,,,,,,...,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,,,86155.29,6.935041
std,18890.179119,,,,,,,,,,...,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,,,186757.0,2.088259
min,1.0,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,,,1.0,0.0
25%,16360.0,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,,,32712.0,6.0
50%,32719.0,,,,,,,,,,...,20.0,15.0,10.0,5.0,0.0,0.0,,,65000.0,7.0
75%,49078.0,,,,,,,,,,...,30.0,30.0,25.0,20.0,10.0,10.0,,,107971.5,8.0


In [30]:
# missing values
missing_values = pd.DataFrame(df.isna().sum()).reset_index()
missing_values.columns = ["column", "count_missing_vals"]
missing_values["percent_missing"] = missing_values["count_missing_vals"]/df.shape[0]*100
missing_values.sort_values("percent_missing", ascending=False)

Unnamed: 0,column,count_missing_vals,percent_missing
75,AINextMuch less integrated,64289,98.245641
74,AINextLess integrated,63082,96.401119
72,AINextNo change,52939,80.900714
71,AINextMuch more integrated,51999,79.464217
36,EmbeddedAdmired,48704,74.428840
...,...,...,...
1,MainBranch,0,0.000000
5,Check,0,0.000000
3,Employment,0,0.000000
2,Age,0,0.000000


## 3. Data Preparation

In [31]:
# make copy of original dataframe
df_clean = df.copy()

In [32]:
# keep only relevant columns (personal data of the participants and language/tech tool data)
columns_to_keep = [
    'Age',
    'Country',
    'YearsCodePro',
    'DevType',
    'EdLevel',
    'CompTotal',
    'Currency',
    'LanguageHaveWorkedWith',
    'LanguageWantToWorkWith',
    'DatabaseHaveWorkedWith',
    'DatabaseWantToWorkWith',
    'PlatformHaveWorkedWith',
    'PlatformWantToWorkWith',
    'WebframeHaveWorkedWith',
    'WebframeWantToWorkWith',
    'MiscTechHaveWorkedWith',
    'MiscTechWantToWorkWith',
    'ToolsTechHaveWorkedWith',
    'ToolsTechWantToWorkWith',
    'AISearchDevHaveWorkedWith',
    'AISearchDevWantToWorkWith',
    'AISelect',
    'AISent',
    'AIBen',
    'AIAcc',
    'AIComplex'
]

In [33]:
df_clean = df[columns_to_keep]
df_clean.shape

(65437, 26)

In [34]:
df_clean.head()

Unnamed: 0,Age,Country,YearsCodePro,DevType,EdLevel,CompTotal,Currency,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,...,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,AISearchDevHaveWorkedWith,AISearchDevWantToWorkWith,AISelect,AISent,AIBen,AIAcc,AIComplex
0,Under 18 years old,United States of America,,,Primary/elementary school,,,,,,...,,,,,,Yes,Very favorable,Increase productivity,,
1,35-44 years old,United Kingdom of Great Britain and Northern I...,17.0,"Developer, full-stack","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",,,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Dynamodb;MongoDB;PostgreSQL,...,,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,,,"No, and I don't plan to",,,,
2,45-54 years old,United Kingdom of Great Britain and Northern I...,27.0,Developer Experience,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",,,C#,C#,Firebase Realtime Database,...,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,MSBuild,MSBuild,,,"No, and I don't plan to",,,,
3,18-24 years old,Canada,,"Developer, full-stack",Some college/university study without earning ...,,,C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,MongoDB;MySQL;PostgreSQL;SQLite,...,,Docker;npm;Pip,Docker;Kubernetes;npm,,,Yes,Very favorable,Increase productivity;Greater efficiency;Impro...,Somewhat trust,Bad at handling complex tasks
4,18-24 years old,Norway,,"Developer, full-stack","Secondary school (e.g. American high school, G...",,,C++;HTML/CSS;JavaScript;Lua;Python;Rust,C++;HTML/CSS;JavaScript;Lua;Python,PostgreSQL;SQLite,...,,APT;Make;npm,APT;Make,,,"No, and I don't plan to",,,,


In [35]:
df_clean.dtypes

Age                           object
Country                       object
YearsCodePro                  object
DevType                       object
EdLevel                       object
CompTotal                    float64
Currency                      object
LanguageHaveWorkedWith        object
LanguageWantToWorkWith        object
DatabaseHaveWorkedWith        object
DatabaseWantToWorkWith        object
PlatformHaveWorkedWith        object
PlatformWantToWorkWith        object
WebframeHaveWorkedWith        object
WebframeWantToWorkWith        object
MiscTechHaveWorkedWith        object
MiscTechWantToWorkWith        object
ToolsTechHaveWorkedWith       object
ToolsTechWantToWorkWith       object
AISearchDevHaveWorkedWith     object
AISearchDevWantToWorkWith     object
AISelect                      object
AISent                        object
AIBen                         object
AIAcc                         object
AIComplex                     object
dtype: object

In [36]:
# age as categorical variable
df_clean['Age'] = df_clean['Age'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Age'] = df_clean['Age'].astype('category')


In [37]:
df_clean['Age'].unique()

['Under 18 years old', '35-44 years old', '45-54 years old', '18-24 years old', '25-34 years old', '55-64 years old', 'Prefer not to say', '65 years or older']
Categories (8, object): ['18-24 years old', '25-34 years old', '35-44 years old', '45-54 years old', '55-64 years old', '65 years or older', 'Prefer not to say', 'Under 18 years old']

In [38]:
df_clean["YearsCodePro"].unique()

array([nan, '17', '27', '7', '11', '25', '12', '10', '3',
       'Less than 1 year', '18', '37', '15', '20', '6', '2', '16', '8',
       '14', '4', '45', '1', '24', '29', '5', '30', '26', '9', '33', '13',
       '35', '23', '22', '31', '19', '21', '28', '34', '32', '40', '50',
       '39', '44', '42', '41', '36', '38', 'More than 50 years', '43',
       '47', '48', '46', '49'], dtype=object)

In [39]:
# convert YearsCodePro to numeric (replace non-numeric values)
df_clean['YearsCodePro'] = df_clean['YearsCodePro'].replace({
    'Less than 1 year': 0.5,
    'More than 50 years': 51
})
df_clean['YearsCodePro'] = pd.to_numeric(df_clean['YearsCodePro'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['YearsCodePro'] = df_clean['YearsCodePro'].replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['YearsCodePro'] = pd.to_numeric(df_clean['YearsCodePro'], errors='coerce')


In [40]:
df_clean.dtypes

Age                          category
Country                        object
YearsCodePro                  float64
DevType                        object
EdLevel                        object
CompTotal                     float64
Currency                       object
LanguageHaveWorkedWith         object
LanguageWantToWorkWith         object
DatabaseHaveWorkedWith         object
DatabaseWantToWorkWith         object
PlatformHaveWorkedWith         object
PlatformWantToWorkWith         object
WebframeHaveWorkedWith         object
WebframeWantToWorkWith         object
MiscTechHaveWorkedWith         object
MiscTechWantToWorkWith         object
ToolsTechHaveWorkedWith        object
ToolsTechWantToWorkWith        object
AISearchDevHaveWorkedWith      object
AISearchDevWantToWorkWith      object
AISelect                       object
AISent                         object
AIBen                          object
AIAcc                          object
AIComplex                      object
dtype: objec

In [41]:
df_clean.head()

Unnamed: 0,Age,Country,YearsCodePro,DevType,EdLevel,CompTotal,Currency,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,...,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,AISearchDevHaveWorkedWith,AISearchDevWantToWorkWith,AISelect,AISent,AIBen,AIAcc,AIComplex
0,Under 18 years old,United States of America,,,Primary/elementary school,,,,,,...,,,,,,Yes,Very favorable,Increase productivity,,
1,35-44 years old,United Kingdom of Great Britain and Northern I...,17.0,"Developer, full-stack","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",,,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Dynamodb;MongoDB;PostgreSQL,...,,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,,,"No, and I don't plan to",,,,
2,45-54 years old,United Kingdom of Great Britain and Northern I...,27.0,Developer Experience,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",,,C#,C#,Firebase Realtime Database,...,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,MSBuild,MSBuild,,,"No, and I don't plan to",,,,
3,18-24 years old,Canada,,"Developer, full-stack",Some college/university study without earning ...,,,C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,MongoDB;MySQL;PostgreSQL;SQLite,...,,Docker;npm;Pip,Docker;Kubernetes;npm,,,Yes,Very favorable,Increase productivity;Greater efficiency;Impro...,Somewhat trust,Bad at handling complex tasks
4,18-24 years old,Norway,,"Developer, full-stack","Secondary school (e.g. American high school, G...",,,C++;HTML/CSS;JavaScript;Lua;Python;Rust,C++;HTML/CSS;JavaScript;Lua;Python,PostgreSQL;SQLite,...,,APT;Make;npm,APT;Make,,,"No, and I don't plan to",,,,


In [42]:
# identify columns with more than 1 value
multiple_value_cols = [col for col in df_clean.columns if df_clean[col].astype(str).str.contains(";").any()]

# apply split() to these columns
for col in multiple_value_cols:
    df_clean[col] = df_clean[col].str.split(";")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[col] = df_clean[col].str.split(";")


In [43]:
df_clean.head()

Unnamed: 0,Age,Country,YearsCodePro,DevType,EdLevel,CompTotal,Currency,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,...,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,AISearchDevHaveWorkedWith,AISearchDevWantToWorkWith,AISelect,AISent,AIBen,AIAcc,AIComplex
0,Under 18 years old,United States of America,,,Primary/elementary school,,,,,,...,,,,,,Yes,Very favorable,[Increase productivity],,
1,35-44 years old,United Kingdom of Great Britain and Northern I...,17.0,"Developer, full-stack","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",,,"[Bash/Shell (all shells), Go, HTML/CSS, Java, ...","[Bash/Shell (all shells), Go, HTML/CSS, Java, ...","[Dynamodb, MongoDB, PostgreSQL]",...,,"[Docker, Homebrew, Kubernetes, npm, Vite, Webp...","[Docker, Homebrew, Kubernetes, npm, Vite, Webp...",,,"No, and I don't plan to",,,,
2,45-54 years old,United Kingdom of Great Britain and Northern I...,27.0,Developer Experience,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",,,[C#],[C#],[Firebase Realtime Database],...,"[.NET (5+) , .NET Framework (1.0 - 4.8), .NET ...",[MSBuild],[MSBuild],,,"No, and I don't plan to",,,,
3,18-24 years old,Canada,,"Developer, full-stack",Some college/university study without earning ...,,,"[C, C++, HTML/CSS, Java, JavaScript, PHP, Powe...","[HTML/CSS, Java, JavaScript, PowerShell, Pytho...","[MongoDB, MySQL, PostgreSQL, SQLite]",...,,"[Docker, npm, Pip]","[Docker, Kubernetes, npm]",,,Yes,Very favorable,"[Increase productivity, Greater efficiency, Im...",Somewhat trust,Bad at handling complex tasks
4,18-24 years old,Norway,,"Developer, full-stack","Secondary school (e.g. American high school, G...",,,"[C++, HTML/CSS, JavaScript, Lua, Python, Rust]","[C++, HTML/CSS, JavaScript, Lua, Python]","[PostgreSQL, SQLite]",...,,"[APT, Make, npm]","[APT, Make]",,,"No, and I don't plan to",,,,


In [50]:
# there are different currencies which have to be converted to USD
#identifxy all used currencies where CompTotal is given
df_clean['Currency'] = df_clean['Currency'].astype(str).str.extract(r'([A-Z]{3})')
currencies = df_clean['Currency'].unique()
print(currencies)





[nan 'PKR' 'EUR' 'USD' 'BRL' 'GBP' 'RON' 'INR' 'CHF' 'TRY' 'RUB' 'ZAR'
 'CZK' 'CAD' 'IRR' 'MXN' 'UAH' 'DOP' 'KMF' 'RSD' 'PEN' 'MAD' 'GEL' 'PLN'
 'SAR' 'SEK' 'BGN' 'KZT' 'SGD' 'JOD' 'JPY' 'NOK' 'ILS' 'DKK' 'THB' 'RWF'
 'HUF' 'BDT' 'IDR' 'BAM' 'PHP' 'XOF' 'DZD' 'TND' 'MYR' 'BHD' 'ARS' 'NIO'
 'AFN' 'UYU' 'BYN' 'COP' 'ALL' 'AUD' 'UZS' 'NZD' 'MVR' 'GHS' 'AED' 'NGN'
 'FJD' 'GTQ' 'UGX' 'CRC' 'MUR' 'KES' 'EGP' 'TWD' 'AMD' 'KRW' 'CLP' 'ISK'
 'HNL' 'HKD' 'CNY' 'VND' 'BSD' 'LKR' 'BTN' 'MNT' 'KHR' 'NPR' 'BOB' 'ETB'
 'AOA' 'MKD' 'SYP' 'NAD' 'ANG' 'TJS' 'BIF' 'JMD' 'TTD' 'SLL' 'SRD' 'GYD'
 'KGS' 'ZMW' 'MDL' 'OMR' 'CUP' 'XPF' 'KYD' 'TZS' 'KWD' 'TMT' 'QAR' 'YER'
 'MWK' 'IQD' 'IMP' 'KPW' 'XAF' 'MGA' 'PYG' 'ERN' 'MMK' 'SHP' 'MZN' 'AZN'
 'LYD' 'MOP' 'LBP' 'BND' 'VES' 'SOS' 'CDF' 'XDR' 'MRU' 'WST' 'SDG' 'XCD'
 'FKP' 'BWP' 'GGP' 'CVE' 'GIP' 'SZL' 'AWG' 'BBD' 'BMD']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Currency'] = df_clean['Currency'].astype(str).str.extract(r'([A-Z]{3})')


## 4. Data Analysis & Modeling

## 5. Evaluation

## 6. Deployment / Presentation