# Data description

---


This project aimed at providing a data-driven solution to enable students to gain insights into the relationships between different functions and their associated technologies.

**Data Source**:

- The project will leverage the Stack Overflow Developer Survey 2023 dataset, as it provides a rich source of information regarding the skills and technologies associated with various IT jobs. This dataset will serve as the foundation for the data-driven analysis and recommendations provided by SkillMap.

- Data link: https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2023-developers-survey?select=survey_results_public.csv


## Data schema


1. **ResponseId (or equivalent identifier):** Unique identifier for each survey response.

2. **MainBranch:** Describes the respondent's current role or branch (e.g., developer by profession).

3. **Age:** Provides information about the respondent's age.

4. **Employment:** Describes the respondent's current employment status.

5. **RemoteWork:** Indicates the respondent's current work situation regarding remote work.

6. **CodingActivities:** Describes the code-related activities the respondent engages in outside of work.

7. **EdLevel:** Specifies the highest level of formal education completed by the respondent.

8. **LearnCode, LearnCodeOnline, LearnCodeCoursesCert:** Describes how the respondent learns to code, including online resources and courses.

9. **YearsCode, YearsCodePro:** Indicates the total years of coding experience, including professional and non-professional.

10. **DevType:** Describes the respondent's current job role or function.

11. **OrgSize:** Specifies the size of the organization where the respondent works.

12. **PurchaseInfluence:** Indicates the level of influence the respondent has over new technology purchases in their organization.

13. **TechList:** Describes the respondent's approach when thinking about new technology purchases.

14. **BuyNewTool:** Describes how the respondent discovers and researches new tools or software.

15. **Country:** Specifies the respondent's location or country.

16. **Currency:** Indicates the currency used day-to-day by the respondent.

17. **CompTotal:** Provides information about the respondent's current total annual compensation.

18. **LanguageHaveWorkedWith, LanguageWantToWorkWith:** Lists programming languages the respondent has worked with and wants to work with.

19. **DatabaseHaveWorkedWith, DatabaseWantToWorkWith:** Lists databases the respondent has worked with and wants to work with.

20. **PlatformHaveWorkedWith, PlatformWantToWorkWith:** Lists platforms the respondent has worked with and wants to work with.

21. **WebframeHaveWorkedWith, WebframeWantToWorkWith:** Lists web frameworks the respondent has worked with and wants to work with.

22. **MiscTechHaveWorkedWith, MiscTechWantToWorkWith:** Lists miscellaneous technologies the respondent has worked with and wants to work with.

23. **ToolsTechHaveWorkedWith, ToolsTechWantToWorkWith:** Lists tools and technologies the respondent has worked with and wants to work with.

24. **NEWCollabToolsHaveWorkedWith, NEWCollabToolsWantToWorkWith:** Lists collaborative tools the respondent has worked with and wants to work with.

25. **OfficeStackAsyncHaveWorkedWith, OfficeStackAsyncWantToWorkWith:** Lists asynchronous office stack tools the respondent has worked with and wants to work with.

26. **OfficeStackSyncHaveWorkedWith, OfficeStackSyncWantToWorkWith:** Lists synchronous office stack tools the respondent has worked with and wants to work with.

27. **AISearchHaveWorkedWith, AISearchWantToWorkWith:** Lists AI-powered search tools the respondent has worked with and wants to work with.

28. **AIDevHaveWorkedWith, AIDevWantToWorkWith:** Lists AI-powered developer tools the respondent has worked with and wants to work with.

29. **NEWSOSites:** Lists Stack Overflow sites visited by the respondent.

30. **SOVisitFreq, SOAccount, SOPartFreq, SOComm:** Describes the respondent's engagement with Stack Overflow.

31. **ConvertedCompYearly:** Indicates the respondent's yearly compensation.

32. **SurveyLength, SurveyEase:** Describes the respondent's perception of the survey length and ease.

33. **Knowledge_1 to Knowledge_8:** Ratings indicating the respondent's level of agreement with various statements related to knowledge and workflow.

34. **Frequency_1 to Frequency_3:** Ratings indicating the frequency of certain experiences at work.

35. **TimeSearching, TimeAnswering:** Indicate the average time spent searching for answers and answering questions at work.

36. **ProfessionalTech, Industry, SOTeamsUsage:** Provide information about the respondent's company and industry.

37. **TBranch, ICorPM, WorkExp:** Describe the respondent's willingness to participate in the Professional Developer Series, individual contributor or people manager status, and years of working experience.

38. **S7 (Final Questions), SurveyLength, SurveyEase:** Provide additional insights into the respondent's perception of the survey.


# Problem Description


- In the fast-evolving landscape of IT jobs and technologies, students often find themselves bewildered by the multitude of skills associated with different roles. Questions like "Do I need to learn C++ to be a Data Scientist?" or "Can I use JavaScript in Data Analytics?" reflect the confusion that aspiring professionals face.

- Our client is an IT educational institute that asked us to create a project aimed at providing a data-driven solution to enable students to gain insights into the relationships between different functions and their associated technologies.


# Data undestinding

---


In [7]:
DATA_PATH = r'..\data\raw\survey_results_public.csv'

## Importing


In [3]:
import pandas as pd
import numpy as np

# Set the maximum number of rows and columns to display
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [8]:
df = pd.read_csv(DATA_PATH)

In [9]:
df.head()

Unnamed: 0,ResponseId,Q120,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,TechList,BuyNewTool,Country,Currency,CompTotal,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSysPersonal use,OpSysProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,AISearchHaveWorkedWith,AISearchWantToWorkWith,AIDevHaveWorkedWith,AIDevWantToWorkWith,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,SOAI,AISelect,AISent,AIAcc,AIBen,AIToolInterested in Using,AIToolCurrently Using,AIToolNot interested in Using,AINextVery different,AINextNeither different nor similar,AINextSomewhat similar,AINextVery similar,AINextSomewhat different,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Knowledge_8,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,ProfessionalTech,Industry,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I agree,None of these,18-24 years old,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,I agree,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;Friend or fam...,Formal documentation provided by the owner of ...,Other,18.0,9.0,"Senior Executive (C-Suite, VP, etc.)",2 to 9 employees,I have a great deal of influence,Investigate,Start a free trial;Ask developers I know/work ...,United States of America,USD\tUnited States dollar,285000.0,HTML/CSS;JavaScript;Python,Bash/Shell (all shells);C#;Dart;Elixir;GDScrip...,Supabase,Firebase Realtime Database;Supabase,Amazon Web Services (AWS);Netlify;Vercel,Fly.io;Netlify;Render,Next.js;React;Remix;Vue.js,Deno;Elm;Nuxt.js;React;Svelte;Vue.js,Electron;React Native;Tauri,Capacitor;Electron;Tauri;Uno Platform;Xamarin,Docker;Kubernetes;npm;Pip;Vite;Webpack;Yarn,Godot;npm;pnpm;Unity 3D;Unreal Engine;Vite;Web...,Vim;Visual Studio Code,Vim;Visual Studio Code,iOS;iPadOS;MacOS;Windows;Windows Subsystem for...,MacOS;Windows;Windows Subsystem for Linux (WSL),Asana;Basecamp;GitHub Discussions;Jira;Linear;...,GitHub Discussions;Linear;Notion;Trello,Cisco Webex Teams;Discord;Google Chat;Google M...,Discord;Signal;Slack;Zoom,ChatGPT,ChatGPT;Neeva AI,GitHub Copilot,GitHub Copilot,Stack Overflow;Stack Exchange,Daily or almost daily,Yes,A few times per month or weekly,"Yes, definitely","I don't think it's super necessary, but I thin...",Yes,Indifferent,Other (please explain),Somewhat distrust,Learning about a codebase;Writing code;Debuggi...,Writing code;Committing and reviewing code,,,,,,,Yes,People manager,10.0,Strongly agree,Agree,Strongly agree,Agree,Agree,Agree,Agree,Strongly agree,1-2 times a week,10+ times a week,Never,15-30 minutes a day,15-30 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,285000.0
2,3,I agree,I am a developer by profession,45-54 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Professional development or self-paced l...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Formal documentation provided by the owner of ...,,27.0,23.0,"Developer, back-end","5,000 to 9,999 employees",I have some influence,Given a list,Start a free trial;Ask developers I know/work ...,United States of America,USD\tUnited States dollar,250000.0,Bash/Shell (all shells);Go,Haskell;OCaml;Rust,,,Amazon Web Services (AWS);Google Cloud;OpenSta...,,,,,,Cargo;Docker;Kubernetes;Make;Nix,Cargo;Kubernetes;Nix,Emacs;Helix,Emacs;Helix,MacOS;Other Linux-based,MacOS;Other Linux-based,Markdown File;Stack Overflow for Teams,Markdown File,Microsoft Teams;Slack;Zoom,Slack;Zoom,,,,,Stack Overflow;Stack Exchange;Stack Overflow f...,A few times per month or weekly,Yes,Less than once per month or monthly,Neutral,,"No, and I don't plan to",,,,,,,,,,,,Yes,Individual contributor,23.0,Strongly agree,Neither agree nor disagree,Agree,Agree,Agree,Agree,Agree,Agree,6-10 times a week,6-10 times a week,3-5 times a week,30-60 minutes a day,30-60 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,250000.0
3,4,I agree,I am a developer by profession,25-34 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Colleague;Friend or family member;Other online...,Formal documentation provided by the owner of ...,,12.0,7.0,"Developer, front-end",100 to 499 employees,I have some influence,Investigate,Start a free trial;Ask developers I know/work ...,United States of America,USD\tUnited States dollar,156000.0,Bash/Shell (all shells);HTML/CSS;JavaScript;PH...,Bash/Shell (all shells);HTML/CSS;JavaScript;Ru...,PostgreSQL;Redis,PostgreSQL;Redis,Cloudflare;Heroku,Cloudflare;Heroku,Node.js;React;Ruby on Rails;Vue.js;WordPress,Node.js;Ruby on Rails;Vue.js,,,Homebrew;npm;Vite;Webpack;Yarn,Homebrew;npm;Vite,IntelliJ IDEA;Vim;Visual Studio Code;WebStorm,IntelliJ IDEA;Vim;WebStorm,iOS;iPadOS;MacOS,iOS;iPadOS;MacOS,Jira,Jira,Discord;Google Meet;Microsoft Teams;Slack;Zoom,Discord;Google Meet;Slack;Zoom,,,,,Stack Overflow;Stack Exchange,A few times per week,Yes,Less than once per month or monthly,"No, not really",I'm wearing of Stack Overflow using AI.,"No, and I don't plan to",,,,,,,,,,,,Yes,Individual contributor,7.0,Strongly agree,Strongly disagree,Strongly agree,Strongly agree,Agree,Neither agree nor disagree,Agree,Agree,1-2 times a week,10+ times a week,1-2 times a week,15-30 minutes a day,30-60 minutes a day,Automated testing;Continuous integration (CI) ...,,Appropriate in length,Easy,156000.0
4,5,I agree,I am a developer by profession,25-34 years old,"Employed, full-time;Independent contractor, fr...",Remote,Hobby;Contribute to open-source projects;Profe...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Online Courses or Certi...,Formal documentation provided by the owner of ...,Other;Codecademy;edX,6.0,4.0,"Developer, full-stack",20 to 99 employees,I have some influence,Investigate,Start a free trial;Ask developers I know/work ...,Philippines,PHP\tPhilippine peso,1320000.0,HTML/CSS;JavaScript;TypeScript,HTML/CSS;JavaScript;Python;Rust;TypeScript,BigQuery;Elasticsearch;MongoDB;PostgreSQL,Elasticsearch;MongoDB;PostgreSQL;Redis;Supabase,Amazon Web Services (AWS);Firebase;Heroku;Netl...,Amazon Web Services (AWS);Cloudflare;Digital O...,Express;Gatsby;NestJS;Next.js;Node.js;React,Express;NestJS;Next.js;Node.js;React;Remix;Vue.js,,,Docker;npm;Webpack;Yarn,Docker;npm;Yarn,Vim;Visual Studio Code,Vim;Visual Studio Code,Other (Please Specify):,Other (Please Specify):,Confluence;Jira;Notion,Confluence;Jira;Notion,Discord;Google Meet;Slack;Zoom,Discord;Google Meet;Slack;Zoom,ChatGPT,ChatGPT,,,Stack Overflow;Stack Exchange,A few times per week,No,,Neutral,Using AI to suggest better answer to my questi...,Yes,Very favorable,Increase productivity;Greater efficiency;Speed...,Somewhat trust,Project planning;Testing code;Committing and r...,Learning about a codebase;Writing code;Documen...,,,,,,,Yes,Individual contributor,6.0,Agree,Strongly agree,Agree,Agree,Neither agree nor disagree,Agree,Strongly agree,Agree,1-2 times a week,1-2 times a week,3-5 times a week,60-120 minutes a day,30-60 minutes a day,Microservices;Automated testing;Observability ...,Other,Appropriate in length,Neither easy nor difficult,23456.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89184 entries, 0 to 89183
Data columns (total 84 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ResponseId                           89184 non-null  int64  
 1   Q120                                 89184 non-null  object 
 2   MainBranch                           89184 non-null  object 
 3   Age                                  89184 non-null  object 
 4   Employment                           87898 non-null  object 
 5   RemoteWork                           73810 non-null  object 
 6   CodingActivities                     73764 non-null  object 
 7   EdLevel                              87973 non-null  object 
 8   LearnCode                            87663 non-null  object 
 9   LearnCodeOnline                      70084 non-null  object 
 10  LearnCodeCoursesCert                 37076 non-null  object 
 11  YearsCode                   

In [18]:
df.iloc[4]

ResponseId                                                                             5
Q120                                                                             I agree
MainBranch                                                I am a developer by profession
Age                                                                      25-34 years old
Employment                             Employed, full-time;Independent contractor, fr...
RemoteWork                                                                        Remote
CodingActivities                       Hobby;Contribute to open-source projects;Profe...
EdLevel                                     Bachelor’s degree (B.A., B.S., B.Eng., etc.)
LearnCode                              Books / Physical media;Online Courses or Certi...
LearnCodeOnline                        Formal documentation provided by the owner of ...
LearnCodeCoursesCert                                                Other;Codecademy;edX
YearsCode            

In [20]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ResponseId,89184.0,44592.5,25745.35,1.0,22296.75,44592.5,66888.25,89184.0
CompTotal,48225.0,1.036807e+42,2.276847e+44,0.0,63000.0,115000.0,230000.0,5e+46
WorkExp,43579.0,11.40513,9.051989,0.0,5.0,9.0,16.0,50.0
ConvertedCompYearly,48019.0,103110.1,681418.8,1.0,43907.0,74963.0,121641.0,74351430.0


## Well, at first glance, the data contains a lot of columns that are not useful for our case. Therefore, you should select the features that are most relevant to our problem first.


## will do that in next notebook --> feature_selection.ipynb
