# IT Educational Institute

## Problem Statement
Our client is an IT Educational Institute. They have reached out to us with the following:

IT jobs and technologies keep evolving quickly. This makes our field one of the most interesing out there. But on the other hand, such fast development confuses our students. They do not know which skills they need to learn for which job.

"Do I need to learn C++ to be a Data Scientist?" "Do DevOps and Syste, admins use the same technologies?" "I really like JavaScript, can I use it in Data Analytics?" Those are some of the questions that our students ask.

Could you please develop a data-driven solution for our students to answer such questions? They mostly want to udnerstand the relationships between the jobs and technologies.

## Data Science Workflow

![data_science_workflow](../reports/figures/data_science_workflow.png)

## 1. Business Problem
You are asking a commercial business to invest in a new project. You need to prove that your work will have a positive financial impact. __How will you prove this? What are the KPIs that you will positively impact?__
1. __Higher enrollment rate due to higher certainty__
2. __Decrease in drop-out rate__
3. __Time saved for the academic advisors__

## 2. Data
### What is your Data Source?
Our client doesn't have nay internal data sources that could be used for this project. __Find the data source that you will use to build the solution__. 
### Where to Start?
https://datasearch.research.google.com/

__Be Careful__:
- Be thorough with the quality checks
- Make sure that your data will be updated on a regular base

### [Chosen Data Source: Stack Overflow Developers Survey 2022](https://insights.stackoverflow.com/survey/2022)
![stackoverflow_survey_2022](../reports/figures/stackoverflow_survey_2022.png)

### Data Description
The enclosed data set is the complete, cleaned results of the 2022 Stack Overflow Developer Survey. Free response submissions and personally-identifying information have been removed from the results to protect the privacy of respondents. There are three files besides this README:

1. survey_results_public.csv - CSV file with main survey results, one respondent per row and one column per answer
2. survey_results_schema.csv - CSV file with survey schema, i.e., the questions that correspond to each column name
3. so_survey_2022.pdf - PDF file of the survey instrument

The survey was fielded from May 11, 2022 to June 1, 2022. The median time spent on the survey for qualified responses was 15.08 minutes.

Respondents were recruited primarily through channels owned by Stack Overflow. The top 5 sources of respondents were onsite messaging, blog posts, email lists, Meta posts, banner ads, and social media posts. Since respondents were recruited in this way, highly engaged users on Stack Overflow were more likely to notice the links for the survey and click to begin it.

This database - The Public 2022 Stack Overflow Developer Survey Results - is made available under the Open Database License (ODbL): http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/

## 3. Foundations
### a. Legal and Data Privacy Check
__Global__: https://www.privacyaffairs.com/gdpr-fines/
__Local__: Look for local regulations regarding location of business. 

### b. How To Structure Your Project
https://drivendata.github.io/cookiecutter-data-science/
![cookiecutter](../reports/figures/cookiecutter_directory_structure.webp)

### c. Your Git Repo
https://developerhowto.com/2018/10/12/git-for-beginners/

## 4. Preprocessing
### Preprossing at first glance
1. String values in years need to be replaced
2. Multiple values separated by ";" need to be splitted

## Exploratory Data Analysis

In [9]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
# Data and schema files paths
data_path = "../data/raw/survey_results_public.csv"
schema_path = "../data/raw/survey_results_schema.csv"

# Reading data using pandas
raw_df = pd.read_csv(data_path)
schema_df = pd.read_csv(schema_path)

In [11]:
# Printing the shape of the data
print("Shape of data is", raw_df.shape)
print("Shape of schema is", schema_df.shape)

Shape of data is (73268, 79)
Shape of schema is (79, 6)


In [12]:
# Setting the display options to None
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Displaying first 5 observations
raw_df.head()

Unnamed: 0,ResponseId,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,Country,Currency,CompTotal,CompFreq,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,OpSysProfessional use,OpSysPersonal use,VersionControlSystem,VCInteraction,VCHostingPersonal use,VCHostingProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,Blockchain,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOComm,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,None of these,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects,,,,,,,,,,,Canada,CAD\tCanadian dollar,,,JavaScript;TypeScript,Rust;TypeScript,,,,,,,,,,,,,macOS,Windows Subsystem for Linux (WSL),Git,,,,,,,,Very unfavorable,Collectives on Stack Overflow;Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies);Stack Overflow;Stack Exchange,Daily or almost daily,Yes,Daily or almost daily,Not sure,,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Difficult,
2,3,"I am not primarily a developer, but I write code sometimes as part of my work","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Books / Physical media;Friend or family member;Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc)",Technical documentation;Blogs;Programming Games;Written Tutorials;Stack Overflow,,14.0,5.0,"Data scientist or machine learning specialist;Developer, front-end;Engineer, data;Engineer, site reliability",20 to 99 employees,I have some influence,,United Kingdom of Great Britain and Northern Ireland,GBP\tPound sterling,32000.0,Yearly,C#;C++;HTML/CSS;JavaScript;Python,C#;C++;HTML/CSS;JavaScript;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,Angular.js,Angular;Angular.js,Pandas,.NET,,,Notepad++;Visual Studio,Notepad++;Visual Studio,Windows,Windows,Git,Code editor,,,,,Microsoft Teams,Microsoft Teams,Very unfavorable,Collectives on Stack Overflow;Stack Overflow;Stack Exchange,Multiple times per day,Yes,Multiple times per day,Neutral,25-34 years old,Man,No,Bisexual,White,None of the above,"I have a mood or emotional disorder (e.g., depression, bipolar disorder, etc.);I have an anxiety disorder",No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0
3,4,I am a developer by profession,"Employed, full-time",Fully remote,I don’t code outside of work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Books / Physical media;School (i.e., University, College, etc)",,,20.0,17.0,"Developer, full-stack",100 to 499 employees,I have some influence,Other (please specify):,Israel,ILS\tIsraeli new shekel,60000.0,Monthly,C#;JavaScript;SQL;TypeScript,C#;SQL;TypeScript,Microsoft SQL Server,Microsoft SQL Server,,,ASP.NET;ASP.NET Core,ASP.NET;ASP.NET Core,.NET,.NET,,,Notepad++;Visual Studio;Visual Studio Code,Notepad++;Visual Studio;Visual Studio Code,Windows,Windows,Git,Code editor;Command-line;Version control hosting service web GUI;Dedicated version control GUI application,,,Jira Work Management;Trello,Jira Work Management;Trello,Slack;Zoom,Slack;Zoom,Very unfavorable,Collectives on Stack Overflow;Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies);Stack Overflow;Stack Exchange,Daily or almost daily,Yes,A few times per week,"Yes, definitely",35-44 years old,Man,No,Straight / Heterosexual,White,None of the above,None of the above,No,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,215232.0
4,5,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, forum);School (i.e., University, College, etc);On the job training","Technical documentation;Blogs;Stack Overflow;Online books;Video-based Online Courses;Online challenges (e.g., daily or weekly coding challenges)",,8.0,3.0,"Developer, front-end;Developer, full-stack;Developer, back-end;Developer, desktop or enterprise applications;Developer, QA or test",20 to 99 employees,I have some influence,Start a free trial;Visit developer communities like Stack Overflow,United States of America,USD\tUnited States dollar,,,C#;HTML/CSS;JavaScript;SQL;Swift;TypeScript,C#;Elixir;F#;Go;JavaScript;Rust;TypeScript,Cloud Firestore;Elasticsearch;Microsoft SQL Server;Firebase Realtime Database,Cloud Firestore;Elasticsearch;Firebase Realtime Database;Redis,Firebase;Microsoft Azure,Firebase;Microsoft Azure,Angular;ASP.NET;ASP.NET Core ;jQuery;Node.js,Angular;ASP.NET Core ;Blazor;Node.js,.NET,.NET;Apache Kafka,npm,Docker;Kubernetes,Notepad++;Visual Studio;Visual Studio Code;Xcode,Rider;Visual Studio;Visual Studio Code,Windows,macOS;Windows,Git;Other (please specify):,Code editor,,,,,Microsoft Teams;Zoom,,Unfavorable,Collectives on Stack Overflow;Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies);Stack Overflow;Stack Exchange,Multiple times per day,Yes,Daily or almost daily,"Yes, definitely",25-34 years old,,,,,,,No,,,,,,,,,,,,,,,,,,,,Too long,Easy,


In [13]:
# Taking a look at the schema to understand the features better
schema_df

Unnamed: 0,qid,qname,question,force_resp,type,selector
0,QID16,S0,"<div><span style=""font-size:19px;""><strong>Hello world! </strong></span></div>\n\n<div> </div>\n\n<div>Thank you for taking the 2022 Stack Overflow Developer Survey, the longest running survey of software developers (and anyone else who codes!) on Earth. </div>\n\n<div> </div>\n\n<div>As in previous years, anonymized results of the survey will be made publicly available under the Open Database License, where anyone can download and analyze the data. On that note, throughout the survey, certain answers you and your peers give will be treated as personally identifiable information, and therefore kept out of the anonymized results file. We'll call out each of those in the survey with a note saying ""This information will be kept private."" </div>\n\n<div> </div>\n\n<div>There are seven sections in this survey. The 2nd, 3rd, and 4th sections will appear in a random order.</div><div><br></div>\n\n<div> 1. Basic Information</div>\n\n<div> 2. Education, Work, and Career</div>\n\n<div> 3. Technology and Tech Culture</div>\n\n<div> 4. Stack Overflow Usage + Community</div>\n\n<div> 5. Demographic Information </div>\n\n<div> 6. Professional Developer Series (Optional)</div><div> 7. Final Questions</div>\n\n<div> \n<div>Most questions in this survey are optional. Required questions are marked with *. This anonymous survey will take about 10 minutes to complete. We encourage you to complete it in one sitting.</div><div><br></div>\n</div>\n\n<div><strong>If you use security or ad-blocking plugins, you may see error messages</strong></div>\n\n<div>Our third-party software provider, Qualtrics, does not work well with certain ad blockers and security software. To avoid error messages that prevent you from taking the survey, please try specifically unblocking Qualtrics in your plugin or pausing the plugin while you take the survey. </div>\n\n<div> </div>\n\n<div>To begin, click <strong>Next.</strong></div>",False,DB,TB
1,QID12,MetaInfo,Browser Meta Info,False,Meta,Browser
2,QID1,S1,"<span style=""font-size:22px; font-family: arial,helvetica,sans-serif; font-weight: 700;"">Basic Information</span><br>\n<br>\n<p><span style=""font-size:16px; font-family:arial,helvetica,sans-serif;"">The first section will focus on some basic information about who you are.<br>\n<br>\nMost questions in this section are required. Required questions are noted with *.</span></p>",False,DB,TB
3,QID2,MainBranch,"Which of the following options best describes you today? Here, by ""developer"" we mean ""someone who writes code."" <b>*</b>",True,MC,SAVR
4,QID296,Employment,Which of the following best describes your current employment status?,False,MC,MAVR
5,QID308,RemoteWork,Which best describes your current work situation?,False,MC,SAVR
6,QID297,CodingActivities,Which of the following best describes the code you write outside of work? Select all that apply.,False,MC,MAVR
7,QID190,S2,"<span style=""font-size:22px; font-family: arial,helvetica,sans-serif; font-weight: 700;"">Education, work, and career</span><br />\n \n<p><span style=""font-size:16px; font-family:arial,helvetica,sans-serif;"">This section will focus on your education, work, and career.<br />\n<br />\nMost questions in this section are optional. Required questions are noted with *.</span></p>",False,DB,TB
8,QID25,EdLevel,Which of the following best describes the highest level of formal education that you’ve completed? *,False,MC,SAVR
9,QID276,LearnCode,How did you learn to code? Select all that apply.,False,MC,MAVR


In [14]:
# General information about the data
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73268 entries, 0 to 73267
Data columns (total 79 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ResponseId                      73268 non-null  int64  
 1   MainBranch                      73268 non-null  object 
 2   Employment                      71709 non-null  object 
 3   RemoteWork                      58958 non-null  object 
 4   CodingActivities                58899 non-null  object 
 5   EdLevel                         71571 non-null  object 
 6   LearnCode                       71580 non-null  object 
 7   LearnCodeOnline                 50685 non-null  object 
 8   LearnCodeCoursesCert            29389 non-null  object 
 9   YearsCode                       71331 non-null  object 
 10  YearsCodePro                    51833 non-null  object 
 11  DevType                         61302 non-null  object 
 12  OrgSize                         

From the information above, I see that I've 5 float variables, 1 int and 73 object variables. There are many null values in the dataframe but "VCHostingPersonal use" and "VCHostingProfessional use" has 0 entries. Also, there are are what supposed to be numerical columns like "YearsCode" and "YearsCodePro" but they are of object data types. I'd like to investigate further. 

In [15]:
# Displaying a statistical summary for numerical variables
raw_df.describe()

Unnamed: 0,ResponseId,CompTotal,VCHostingPersonal use,VCHostingProfessional use,WorkExp,ConvertedCompYearly
count,73268.0,38422.0,0.0,0.0,36769.0,38071.0
mean,36634.5,2.342434e+52,,,10.242378,170761.3
std,21150.794099,4.591478e+54,,,8.70685,781413.2
min,1.0,0.0,,,0.0,1.0
25%,18317.75,30000.0,,,4.0,35832.0
50%,36634.5,77500.0,,,8.0,67845.0
75%,54951.25,154000.0,,,15.0,120000.0
max,73268.0,9e+56,,,50.0,50000000.0


In [16]:
# Investigating variables "YearsCode" and "YearsCodePro"
years_cols = ['YearsCode','YearsCodePro']

for col in years_cols:
    print(col)
    print()
    print(raw_df[col].unique().tolist())
    print('-'*100)
    print()

YearsCode

[nan, '14', '20', '8', '15', '3', '1', '6', '37', '5', '12', '22', '11', '4', '7', '13', '36', '2', '25', '10', '40', '16', '27', '24', '19', '9', '17', '18', '26', 'More than 50 years', '29', '30', '32', 'Less than 1 year', '48', '45', '38', '39', '28', '23', '43', '21', '41', '35', '50', '33', '31', '34', '46', '44', '42', '47', '49']
----------------------------------------------------------------------------------------------------

YearsCodePro

[nan, '5', '17', '3', '6', '30', '2', '10', '15', '4', '22', '20', '40', '9', '14', '21', '7', '18', '25', '8', '12', '45', '1', '19', '28', '24', '11', '23', 'Less than 1 year', '32', '27', '16', '44', '26', '37', '46', '13', '31', '39', '34', '38', '35', '29', '42', '36', '33', '43', '41', '48', '50', 'More than 50 years', '47', '49']
----------------------------------------------------------------------------------------------------



There are two values in both columns (YearsCode, YearsCodePro) and they are ('Less than 1 year', 'More than 50 years') which is why both columns are of object data type instead of float. 