# Project 1
### Problem statement
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.

I've decided to look at the stack overflow annual survey for 2023 results of which are available here https://survey.stackoverflow.co/2023/

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from zipfile import ZipFile
import io
from urllib.request import urlopen
import re

Download Stack Overflow Annual survey 2023

In [2]:
r = urlopen('https://cdn.stackoverflow.co/files/jo7n4k8s/production/49915bfd46d0902c3564fd9a06b509d08a20488c.zip/stack-overflow-developer-survey-2023.zip').read()
file = ZipFile(io.BytesIO(r))
df = pd.read_csv(file.open('survey_results_public.csv'))
df.head(2)

Unnamed: 0,ResponseId,Q120,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,ProfessionalTech,Industry,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I agree,None of these,18-24 years old,,,,,,,...,,,,,,,,,,
1,2,I agree,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;Friend or fam...,Formal documentation provided by the owner of ...,...,1-2 times a week,10+ times a week,Never,15-30 minutes a day,15-30 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,285000.0


Let's get rid of the columns we are not interested in.

In [3]:
df = df.drop(list(df)[23:83], axis=1)
df = df.drop(['ResponseId', 'Q120', 'LearnCode', 'LearnCodeOnline', 'LearnCodeCoursesCert', 'DevType', 'OrgSize', 
              'PurchaseInfluence', 'TechList', 'BuyNewTool', 'Country', 'Currency', 'CompTotal'], axis = 1 )
df.head(2)

Unnamed: 0,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,YearsCode,YearsCodePro,LanguageHaveWorkedWith,LanguageWantToWorkWith,ConvertedCompYearly
0,None of these,18-24 years old,,,,,,,,,
1,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",18.0,9.0,HTML/CSS;JavaScript;Python,Bash/Shell (all shells);C#;Dart;Elixir;GDScrip...,285000.0


I am interested in only developers likely to be in the job market, so 25 to 64 year olds will be close enough.

In [4]:
df.drop(df.loc[df['Age']=='18-24 years old'].index, inplace=True)
df.drop(df.loc[df['Age']=='Under 18 years old'].index, inplace=True)
df.drop(df.loc[df['Age']=='65 years or older'].index, inplace=True)
df.drop(df.loc[df['Age']=='Prefer not to say'].index, inplace=True)

I'm interested in python when it comes to languages so we'll create a row fpr those who used Python in the past year

In [5]:

df["worked_with_python"] = df["LanguageHaveWorkedWith"]
df["worked_with_python"] = np.where(df["worked_with_python"].str.contains('Python'), 'Y',df["worked_with_python"])
chars_to_remove = ['Ada', 'Apex', 'APL', 'Assembly', 'Bash/Shell (all shells)', 'C', 'C#', 'C++', 'Clojure', 'Cobol', 'Crystal', 'Dart', 'Delphi', 
                   'Elixir', 'Erlang', 'F#', 'Flow', 'Fortran', 'GDScript', 'Go', 'Groovy', 'Haskell', 'HTML/CSS', 'Java', 'JavaScript', 'Julia', 
                   'Kotlin', 'Lisp', 'Lua', 'MATLAB', 'Nim', 'Objective-C', 'OCaml', 'Perl', 'PHP', 'PowerShell', 'Prolog', 'R', 'Raku', 'Ruby', 
                   'Rust', 'SAS', 'Scala', 'Solidity', 'SQL', 'Swift', 'TypeScript', 'VBA', 'Visual Basic (.Net)', 'Zig', ';']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df["worked_with_python"] = df["worked_with_python"].str.replace(regular_expression, '', regex=True)
df["worked_with_python"] = df["worked_with_python"].replace(r'^\s*$', 'N', regex=True)


Let's do the same for those who want to work with python in the future

In [6]:
df["want_to_python"] = df["LanguageWantToWorkWith"]
df["want_to_python"] = np.where(df["want_to_python"].str.contains('Python'), 'Y',df["want_to_python"])
chars_to_remove = ['Ada', 'Apex', 'APL', 'Assembly', 'Bash/Shell (all shells)', 'C', 'C#', 'C++', 'Clojure', 'Cobol', 'Crystal', 'Dart', 'Delphi', 
                   'Elixir', 'Erlang', 'F#', 'Flow', 'Fortran', 'GDScript', 'Go', 'Groovy', 'Haskell', 'HTML/CSS', 'Java', 'JavaScript', 'Julia', 
                   'Kotlin', 'Lisp', 'Lua', 'MATLAB', 'Nim', 'Objective-C', 'OCaml', 'Perl', 'PHP', 'PowerShell', 'Prolog', 'R', 'Raku', 'Ruby', 
                   'Rust', 'SAS', 'Scala', 'Solidity', 'SQL', 'Swift', 'TypeScript', 'VBA', 'Visual Basic (.Net)', 'Zig', ';']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df["want_to_python"] = df["want_to_python"].str.replace(regular_expression, '', regex=True)
df["want_to_python"] = df["want_to_python"].replace(r'^\s*$', 'N', regex=True)

Create a function to reuse this code

let's drop the old language columns

In [7]:
df = df.drop(['LanguageHaveWorkedWith', 'LanguageWantToWorkWith'], axis = 1 )
df.head(2)

Unnamed: 0,MainBranch,Age,Employment,RemoteWork,CodingActivities,EdLevel,YearsCode,YearsCodePro,ConvertedCompYearly,worked_with_python,want_to_python
1,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",18,9,285000.0,Y,N
2,I am a developer by profession,45-54 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Professional development or self-paced l...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",27,23,250000.0,N,N


next convert ed lavel to irish levels i.e. level 7, 8 etc.