## References

- Dataset: https://www.kaggle.com/code/joebeachcapital/stack-overflow-survey-eda-starter/notebook

- Existing EDA: https://survey.stackoverflow.co/2024/ 

- EDA for categorical variables: https://www.kaggle.com/code/nextbigwhat/eda-for-categorical-variables-a-beginner-s-way 

- Feature selection for categorical data: https://machinelearningmastery.com/feature-selection-with-categorical-data/ 

## Notes 
### More important categories
- TODO Recreate bar chart of response categories

Developer profile:
- Educational attainment, Learning to code
- Years coding (filter by country)
- Years coding professionally
- Developer type
- Geography
- Age (curious if we stratfiy by years of coding experience, if we see a correlation between age and income)

Technology:
- Languages
- Databases
- Cloud platforms
- Web frameworks and technologies
- Embedded technologies
- Other frameworks and libraries
- Other tools (e.g., Docker)
- IDEs

### Less important categories
- Syncronous tools (e.g., Teams)
- Asyncronous tools (e.g., Jira)


## Load data and import packages

In [2]:
%load_ext autoreload
%autoreload 2

import os
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.display import display, Markdown

In [3]:
display(Markdown("-------------\n### Files detected from dataset"))

# Load data from: https://www.kaggle.com/code/joebeachcapital/stack-overflow-survey-eda-starter/notebook 
for dirname, _, filenames in os.walk('stack-overflow-developer-survey-2024/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


-------------
### Files detected from dataset

stack-overflow-developer-survey-2024/survey_results_public.csv
stack-overflow-developer-survey-2024/survey_results_schema.csv
stack-overflow-developer-survey-2024/2024 Developer Survey.pdf


In [4]:
df = pd.read_csv('stack-overflow-developer-survey-2024/survey_results_public.csv')
df.head()   # print the first couple rows 

# Visualize the first few rows of the dataset (default n = 5)
display(Markdown("-------------\n### Top of the Dataset: `df.head()`\nNote: The dataset contains 344 rows and 8 columns."))
display(df.head())

# This method provides a concise summary of the dataset, including the number of non-null values in each column.
display(Markdown("-------------\n### Detailed Information: `df.info()`\n**Note: We have some missing values in the dataset**"))
display(df.info())

# Descriptive statistics for numerical columns
display(Markdown("-------------\n### Descriptive Statistics: `df.describe()`\n"))
display(df.describe())
display(Markdown("-------------"))

-------------
### Top of the Dataset: `df.head()`
Note: The dataset contains 344 rows and 8 columns.

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


-------------
### Detailed Information: `df.info()`
**Note: We have some missing values in the dataset**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Columns: 114 entries, ResponseId to JobSat
dtypes: float64(13), int64(1), object(100)
memory usage: 56.9+ MB


None

-------------
### Descriptive Statistics: `df.describe()`


Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,ConvertedCompYearly,JobSat
count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,23435.0,29126.0
mean,32719.0,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,86155.29,6.935041
std,18890.179119,5.444117e+147,9.168709,25.966221,18.422661,21.833836,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,186757.0,2.088259
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32712.0,6.0
50%,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,0.0,0.0,65000.0,7.0
75%,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,10.0,10.0,107971.5,8.0
max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,16256600.0,10.0


-------------

In [7]:
# Display all column names, data types, and unique values for categorical columns
display(Markdown("-------------\n### All Columns in Dataset"))
print("\nTotal number of columns:", len(df.columns))
print("\nColumns, data types, and unique values:")
for col in df.columns:
    dtype = df[col].dtype
    if dtype in ['object', 'category']:
        n_unique = df[col].nunique()
        print(f"- {col}: {dtype} ({n_unique} unique values)")
    else:
        print(f"- {col}: {dtype}")


-------------
### All Columns in Dataset


Total number of columns: 114

Columns, data types, and unique values:
- ResponseId: int64
- MainBranch: object (5 unique values)
- Age: object (8 unique values)
- Employment: object (110 unique values)
- RemoteWork: object (3 unique values)
- Check: object (1 unique values)
- CodingActivities: object (118 unique values)
- EdLevel: object (8 unique values)
- LearnCode: object (418 unique values)
- LearnCodeOnline: object (10853 unique values)
- TechDoc: object (113 unique values)
- YearsCode: object (52 unique values)
- YearsCodePro: object (52 unique values)
- DevType: object (34 unique values)
- OrgSize: object (10 unique values)
- PurchaseInfluence: object (3 unique values)
- BuyNewTool: object (215 unique values)
- BuildvsBuy: object (3 unique values)
- TechEndorse: object (386 unique values)
- Country: object (185 unique values)
- Currency: object (142 unique values)
- CompTotal: float64
- LanguageHaveWorkedWith: object (23864 unique values)
- LanguageWantToWorkWith: object (22769

## Bar charts

In [8]:
'''
Developer profile:
- Educational attainment, Learning to code
- Years coding (filter by country)
- Years coding professionally
- Developer type
- Geography
- Age (curious if we stratfiy by years of coding experience, if we see a correlation between age and income)

Technology:
- Languages
- Databases
- Cloud platforms
- Web frameworks and technologies
- Embedded technologies
- Other frameworks and libraries
- Other tools (e.g., Docker)
'''
# Create bar charts for various developer profile features
import plotly.express as px
import plotly.graph_objects as go

# Educational Attainment
fig = px.bar(df['EdLevel'].value_counts(), 
             title='Distribution of Educational Attainment',
             labels={'value': 'Count', 'index': 'Education Level'})
fig.show()

# Years Coding Experience
fig = px.bar(df['YearsCode'].value_counts().sort_index(), 
             title='Distribution of Years Coding Experience',
             labels={'value': 'Count', 'index': 'Years of Experience'})
fig.show()

# Years Coding Professionally 
fig = px.bar(df['YearsCodePro'].value_counts().sort_index(),
             title='Distribution of Professional Coding Experience',
             labels={'value': 'Count', 'index': 'Years of Professional Experience'})
fig.show()

# Developer Type
fig = px.bar(df['DevType'].value_counts(),
             title='Distribution of Developer Types',
             labels={'value': 'Count', 'index': 'Developer Type'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

# Geography (Top 20 countries)
fig = px.bar(df['Country'].value_counts().head(20),
             title='Top 20 Countries of Respondents',
             labels={'value': 'Count', 'index': 'Country'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

# Age Distribution
fig = px.bar(df['Age'].value_counts(),
             title='Age Distribution of Respondents',
             labels={'value': 'Count', 'index': 'Age Group'})
fig.show()

# Technology Usage

# Programming Languages
fig = px.bar(df['LanguageHaveWorkedWith'].str.get_dummies(sep=';').sum().sort_values(ascending=False),
             title='Programming Languages Used',
             labels={'value': 'Count', 'index': 'Language'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

# Databases
fig = px.bar(df['DatabaseHaveWorkedWith'].str.get_dummies(sep=';').sum().sort_values(ascending=False),
             title='Databases Used',
             labels={'value': 'Count', 'index': 'Database'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

# Cloud Platforms
fig = px.bar(df['PlatformHaveWorkedWith'].str.get_dummies(sep=';').sum().sort_values(ascending=False),
             title='Cloud Platforms Used',
             labels={'value': 'Count', 'index': 'Platform'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()

# Web Frameworks
fig = px.bar(df['WebframeHaveWorkedWith'].str.get_dummies(sep=';').sum().sort_values(ascending=False),
             title='Web Frameworks Used',
             labels={'value': 'Count', 'index': 'Framework'})
fig.update_layout(xaxis_tickangle=-45)
fig.show()


## Feature selection over categorical variables

In [None]:
# - [ ] TODO NA values as a category

## Covariance matrix

## Dropping rows with no income data?

In [13]:
# Filter rows where CompTotal is NA
df_no_income = df[df['CompTotal'].isna()]
print(f"Number of responses with no income data: {len(df_no_income)}")
print(f"Percentage of total responses: {(len(df_no_income) / len(df) * 100):.2f}%")

# Display sample of rows with no income data
display(df_no_income[['Employment', 'Age', 'CompTotal']])

# Count rows with and without CompTotal data
rows_with_income = len(df[~df['CompTotal'].isna()])
rows_without_income = len(df[df['CompTotal'].isna()])

print(f"\nComparison of rows with/without income data:")
print(f"Rows with income data: {rows_with_income:,}")
print(f"Rows without income data: {rows_without_income:,}")
print(f"Ratio (with:without): {rows_with_income/rows_without_income:.2f}")




Number of responses with no income data: 31697
Percentage of total responses: 48.44%


Unnamed: 0,Employment,Age,CompTotal
0,"Employed, full-time",Under 18 years old,
1,"Employed, full-time",35-44 years old,
2,"Employed, full-time",45-54 years old,
3,"Student, full-time",18-24 years old,
4,"Student, full-time",18-24 years old,
...,...,...,...
65432,"Employed, full-time",18-24 years old,
65433,"Employed, full-time",25-34 years old,
65434,"Employed, full-time",25-34 years old,
65435,"Employed, full-time",18-24 years old,



Comparison of rows with/without income data:
Rows with income data: 33,740
Rows without income data: 31,697
Ratio (with:without): 1.06


# TODO showing how sparse different columns are