# Title here

Description here

## Table of contents
- [1. Business undestanding](#business)
- [2. Data understanding](#data)
    - [2.1. Gathering data](#gather)
    - [2.2. Assessing data](#assess)
- [3. Prepare data](#prepare)
- [4. Data modeling](#model)
- [5. Evaluate the results](#eval)
- [6. Deploy](#deploy)

<a name="business"></a>
## 1. Business understanding

Text text

> Question 1 \
> Question 2 \
> Question 3 \
> Question 4

<a name="data"></a>
## 2. Data understanding

Text text

<a name="gather"></a>
   

<a name="gather"></a>
### 2.1. Gathering data

First, we need to download all the necessary data. In order to do so, we can run the line below to download all Stack Overflow surveys for all years:

In [None]:
# Download survey data
%run -i '../download/download.py'

# Download shape files
%run -i '../download/shape.py'

These are all the surveys since 2011. We will only use the ones from the last five years. One of the reasons for doing so is that the structure of the survey changed and similar questions might not be comparable anymore. Next, in preparation for the next sections we can import the relevant libraries.

In [1]:
# Import libraries
import country_converter as coco
import geopandas as gpd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
%matplotlib inline

<a name="assess"></a>
### 2.2. Assessing data

Now that we have downloaded all the datasets let's start by reading the csvs from the past five years. In doing so, I am ignoring the first column as it serves as an ordered identifier for the respondants.

In [2]:
# Import survey data and skip first column
import warnings; warnings.simplefilter('ignore')
survey_2016 = pd.read_csv("../data/survey/survey_2016.csv").iloc[:, 1:]
survey_2017 = pd.read_csv("../data/survey/survey_2017.csv").iloc[:, 1:]
survey_2018 = pd.read_csv("../data/survey/survey_2018.csv").iloc[:, 1:]
survey_2019 = pd.read_csv("../data/survey/survey_2019.csv").iloc[:, 1:]
survey_2020 = pd.read_csv("../data/survey/survey_2020.csv").iloc[:, 1:]

# Import shapefile with geopandas
map_df = gpd.read_file("../data/shapefile/world_countries_2017.shp")

Great! Now we can quickly look at what these datasets look like. I will do that by picking two random samples from the survey.

In [3]:
# Show dataframe for two random samples for 2020
pd.options.display.max_columns = None # to show all columns
survey_2020.sample(2)

Unnamed: 0,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,CurrencySymbol,DatabaseDesireNextYear,DatabaseWorkedWith,DevType,EdLevel,Employment,Ethnicity,Gender,JobFactors,JobSat,JobSeek,LanguageDesireNextYear,LanguageWorkedWith,MiscTechDesireNextYear,MiscTechWorkedWith,NEWCollabToolsDesireNextYear,NEWCollabToolsWorkedWith,NEWDevOps,NEWDevOpsImpt,NEWEdImpt,NEWJobHunt,NEWJobHuntResearch,NEWLearn,NEWOffTopic,NEWOnboardGood,NEWOtherComms,NEWOvertime,NEWPurchaseResearch,NEWPurpleLink,NEWSOSites,NEWStuck,OpSys,OrgSize,PlatformDesireNextYear,PlatformWorkedWith,PurchaseWhat,Sexuality,SOAccount,SOComm,SOPartFreq,SOVisitFreq,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
59644,"I am not primarily a developer, but I write co...",No,27.0,19,Yearly,132000.0,132000.0,United States,United States dollar,USD,MySQL,MySQL,"Developer, back-end;Engineer, site reliability...","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,White or of European descent,,Industry that I’d be working in;Diversity of t...,Slightly satisfied,I am not interested in new job opportunities,Go,Go,Teraform,Pandas;Teraform,,Confluence;Jira;Github;Slack;Google Suite (Doc...,Yes,Extremely important,Not at all important/not necessary,Having a bad day (or week or month) at work;Cu...,"Read company media, such as employee blogs or ...",Once a year,No,No,Yes,Often: 1-2 days per week or more,Start a free trial;Ask developers I know/work ...,Annoyed,Stack Overflow (public Q&A for anyone who codes),Visit Stack Overflow;Do other work and come ba...,MacOS,100 to 499 employees,Google Cloud Platform;Kubernetes;Raspberry Pi,Docker;Google Cloud Platform;Kubernetes;MacOS,I have a great deal of influence,Bisexual,No,"No, not at all",,A few times per month or weekly,Easy,Appropriate in length,No,"Another engineering discipline (such as civil,...",,,Just as welcome now as I felt last year,50.0,4,3.0
59302,I am a student who is learning to code,Yes,47.0,41,,,,Ukraine,,,MySQL;Oracle;SQLite,MySQL,"Developer, front-end;Developer, full-stack","Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Not employed, but looking for work",White or of European descent,Man,Flex time or a flexible schedule;Office enviro...,,"I’m not actively looking, but I am open to new...",Go;JavaScript;Python;SQL;TypeScript,JavaScript;Python,Node.js;React Native;Torch/PyTorch,Node.js;React Native,Github;Gitlab;Slack;Microsoft Teams;Stack Over...,Github;Gitlab,,,,,,Once a year,,,Yes,,Start a free trial;Ask developers I know/work ...,Amused,I have never visited any of these sites,Visit Stack Overflow;Go for a walk or other ph...,MacOS,,Docker;iOS;Linux;MacOS;Windows,MacOS,,Bisexual;Straight / Heterosexual,,,,,Easy,Appropriate in length,No,"Another engineering discipline (such as civil,...",Angular.js;Django;jQuery;React.js;Vue.js,React.js,,,3,


And for the remaining years we see:

In [4]:
# Random sample for 2019
survey_2019.sample(2)

Unnamed: 0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
56377,"I used to be a developer by profession, but no...",Yes,Never,The quality of OSS and closed source software ...,Retired,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A humanities discipline (ex. literature, histo...",Taken an online course in programming or softw...,,,7,50,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,C#;SQL;Other(s):,C#;SQL;Other(s):,Microsoft SQL Server;MySQL;SQLite,Microsoft SQL Server;MySQL;SQLite,Windows,Windows,,ASP.NET,.NET,.NET;.NET Core,Visual Studio,Windows,I do not use containers,,,Yes,"Fortunately, someone else has that title",Yes,Facebook,In real life (in person),Username,2016,Daily or almost daily,Find answers to specific questions;Learn how t...,1-2 times per week,Stack Overflow was much faster,31-60 minutes,Yes,Multiple times per day,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","Yes, definitely",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,71.0,Woman,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
49061,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em...",United States,No,"Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,"Just me - I am a freelancer, sole proprietor, ...","Designer;Developer, back-end;Developer, front-...",3,15,Less than 1 year,Slightly satisfied,Very satisfied,,,,I am not interested in new job opportunities,Less than a year ago,Write any code;Interview with people in peer r...,Yes,Specific department or team I'd be working on;...,I was preparing for a job search,USD,United States dollar,,Monthly,,,There is a schedule and/or spec (made by me or...,Being tasked with non-development work;Inadequ...,It's complicated,Home,Average,"Yes, because I see value in code review",3.0,,,,HTML/CSS;JavaScript;Python;SQL,C;Go;HTML/CSS;Java;JavaScript;Python;SQL;TypeS...,DynamoDB;MariaDB;MongoDB;MySQL;PostgreSQL;Redi...,DynamoDB;MySQL;PostgreSQL;Redis,Arduino;AWS;Docker;Heroku;Linux;Microsoft Azur...,Arduino;AWS;Docker;Linux;Raspberry Pi;Windows,Django;Flask;React.js;Vue.js,Django;Flask;React.js;Vue.js,Chef;Pandas;TensorFlow,Pandas;TensorFlow,Sublime Text;Visual Studio Code,Windows,Development;Testing;Production,,Useful for immutable record keeping outside of...,Yes,SIGH,Yes,Twitter,In real life (in person),UserID,2015,Daily or almost daily,Find answers to specific questions;Learn how t...,3-5 times per week,Stack Overflow was slightly faster,11-30 minutes,Yes,I have never participated in Q&A on Stack Over...,Yes,"No, I've heard of them, but I am not part of a...",Neutral,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Easy


In [5]:
# Random sample for 2018
survey_2018.sample(2)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,AssessJob1,AssessJob2,AssessJob3,AssessJob4,AssessJob5,AssessJob6,AssessJob7,AssessJob8,AssessJob9,AssessJob10,AssessBenefits1,AssessBenefits2,AssessBenefits3,AssessBenefits4,AssessBenefits5,AssessBenefits6,AssessBenefits7,AssessBenefits8,AssessBenefits9,AssessBenefits10,AssessBenefits11,JobContactPriorities1,JobContactPriorities2,JobContactPriorities3,JobContactPriorities4,JobContactPriorities5,JobEmailPriorities1,JobEmailPriorities2,JobEmailPriorities3,JobEmailPriorities4,JobEmailPriorities5,JobEmailPriorities6,JobEmailPriorities7,UpdateCV,Currency,Salary,SalaryType,ConvertedSalary,CurrencySymbol,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AdsPriorities1,AdsPriorities2,AdsPriorities3,AdsPriorities4,AdsPriorities5,AdsPriorities6,AdsPriorities7,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
35065,Yes,Yes,Germany,"Yes, full-time",Employed full-time,"Other doctoral degree (Ph.D, Ed.D., etc.)","Computer science, computer engineering, or sof...","5,000 to 9,999 employees",Educator or academic researcher;Engineering ma...,18-20 years,12-14 years,Moderately satisfied,Moderately satisfied,Working as an engineering manager or other fun...,"I’m not actively looking, but I am open to new...",Less than a year ago,5.0,7.0,1.0,9.0,2.0,3.0,8.0,4.0,10.0,6.0,1.0,5.0,2.0,3.0,11.0,9.0,10.0,7.0,8.0,4.0,6.0,5.0,1.0,3.0,4.0,2.0,2.0,3.0,1.0,4.0,5.0,6.0,7.0,My job status or other personal status changed,Euros (€),56000,Yearly,68537.0,EUR,"Other chat system (IRC, proprietary software, ...",One to three months,"Taught yourself a new language, framework, or ...",The official documentation and/or standards fo...,,,Strongly agree,Strongly disagree,Disagree,Assembly;C;C++;Java;Python;Scala;SQL;Bash/Shell,Rust,SQLite,PostgreSQL,Linux,Android,Django,TensorFlow,IntelliJ;IPython / Jupyter;Vim,Linux-based,2,,Git;Copying and pasting files to network shares,Multiple times per day,Yes,Yes,The ad-blocking software was causing display i...,Somewhat disagree,Somewhat disagree,Somewhat agree,Saw an online advertisement and then researche...,1.0,4.0,5.0,7.0,2.0,6.0,3.0,Algorithms making important decisions,Increasing automation of jobs,A governmental or other regulatory body,I'm excited about the possibilities more than ...,Depends on what it is,Depends on what it is,Upper management at the company/organization,Yes,10 (Very Likely),A few times per month or weekly,Yes,A few times per month or weekly,"No, I knew that Stack Overflow had a jobs boar...","No, I know what it is but I don't have one",,No,A little bit interested,Not at all interested,Not at all interested,A little bit interested,A little bit interested,Between 7:01 - 8:00 AM,9 - 12 hours,1 - 2 hours,Never,,1 - 2 times per week,Male,Straight or heterosexual,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",White or of European descent,25 - 34 years old,Yes,,The survey was too long,Somewhat easy
56522,Yes,No,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele...",10 to 19 employees,Desktop or enterprise applications developer;M...,12-14 years,9-11 years,Extremely dissatisfied,Moderately dissatisfied,Working as an engineering manager or other fun...,"I’m not actively looking, but I am open to new...",Between 1 and 2 years ago,10.0,4.0,7.0,8.0,2.0,5.0,6.0,1.0,9.0,3.0,1.0,5.0,2.0,7.0,9.0,3.0,8.0,6.0,11.0,10.0,4.0,4.0,1.0,5.0,3.0,2.0,3.0,7.0,1.0,4.0,2.0,5.0,6.0,I had a negative experience or interaction at ...,U.S. dollars ($),121000,Monthly,1452000.0,USD,"Office / productivity suite (Microsoft Office,...",Six to nine months,"Taught yourself a new language, framework, or ...","A book or e-book from O’Reilly, Apress, or a s...",,,Neither Agree nor Disagree,Agree,Strongly disagree,C++;C#;Matlab,C;C++;C#;F#;Matlab,,MongoDB;MySQL;PostgreSQL;SQLite,Android;Windows Desktop or Server,Android;Azure;Windows Desktop or Server,Xamarin,.NET Core;Node.js;Xamarin,Notepad++;Visual Studio,Windows,3,Agile;Mob programming,Subversion,Multiple times per day,No,,,Strongly agree,Somewhat agree,Neither agree nor disagree,Clicked on an online advertisement;Saw an onli...,2.0,4.0,3.0,1.0,6.0,5.0,7.0,Increasing automation of jobs,Algorithms making important decisions,The developers or the people creating the AI,I'm excited about the possibilities more than ...,Depends on what it is,"Yes, but only within the company",Upper management at the company/organization,Yes,8,Daily or almost daily,Yes,Less than once per month or monthly,Yes,"No, I have one but it's out of date",9.0,I'm not sure,Somewhat interested,Very interested,Very interested,Extremely interested,Extremely interested,Between 7:01 - 8:00 AM,5 - 8 hours,Less than 30 minutes,1 - 2 times per week,,I don't typically exercise,Male,Straight or heterosexual,Associate degree,White or of European descent,35 - 44 years old,No,No,The survey was too long,Somewhat easy


In [6]:
# Random sample for 2017
survey_2017.sample(2)

Unnamed: 0,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,CompanyType,YearsProgram,YearsCodedJob,YearsCodedJobPast,DeveloperType,WebDeveloperType,MobileDeveloperType,NonDeveloperType,CareerSatisfaction,JobSatisfaction,ExCoderReturn,ExCoderNotForMe,ExCoderBalance,ExCoder10Years,ExCoderBelonged,ExCoderSkills,ExCoderWillNotCode,ExCoderActive,PronounceGIF,ProblemSolving,BuildingThings,LearningNewTech,BoringDetails,JobSecurity,DiversityImportant,AnnoyingUI,FriendsDevelopers,RightWrongWay,UnderstandComputers,SeriousWork,InvestTimeTools,WorkPayCare,KinshipDevelopers,ChallengeMyself,CompetePeers,ChangeWorld,JobSeekingStatus,HoursPerWeek,LastNewJob,AssessJobIndustry,AssessJobRole,AssessJobExp,AssessJobDept,AssessJobTech,AssessJobProjects,AssessJobCompensation,AssessJobOffice,AssessJobCommute,AssessJobRemote,AssessJobLeaders,AssessJobProfDevel,AssessJobDiversity,AssessJobProduct,AssessJobFinances,ImportantBenefits,ClickyKeys,JobProfile,ResumePrompted,LearnedHiring,ImportantHiringAlgorithms,ImportantHiringTechExp,ImportantHiringCommunication,ImportantHiringOpenSource,ImportantHiringPMExp,ImportantHiringCompanies,ImportantHiringTitles,ImportantHiringEducation,ImportantHiringRep,ImportantHiringGettingThingsDone,Currency,Overpaid,TabsSpaces,EducationImportant,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,CousinEducation,WorkStart,HaveWorkedLanguage,WantWorkLanguage,HaveWorkedFramework,WantWorkFramework,HaveWorkedDatabase,WantWorkDatabase,HaveWorkedPlatform,WantWorkPlatform,IDE,AuditoryEnvironment,Methodology,VersionControl,CheckInCode,ShipIt,OtherPeoplesCode,ProjectManagement,EnjoyDebugging,InTheZone,DifficultCommunication,CollaborateRemote,MetricAssess,EquipmentSatisfiedMonitors,EquipmentSatisfiedCPU,EquipmentSatisfiedRAM,EquipmentSatisfiedStorage,EquipmentSatisfiedRW,InfluenceInternet,InfluenceWorkstation,InfluenceHardware,InfluenceServers,InfluenceTechStack,InfluenceDeptTech,InfluenceVizTools,InfluenceDatabase,InfluenceCloud,InfluenceConsultants,InfluenceRecruitment,InfluenceCommunication,StackOverflowDescribes,StackOverflowSatisfaction,StackOverflowDevices,StackOverflowFoundAnswer,StackOverflowCopiedCode,StackOverflowJobListing,StackOverflowCompanyPage,StackOverflowJobSearch,StackOverflowNewQuestion,StackOverflowAnswer,StackOverflowMetaChat,StackOverflowAdsRelevant,StackOverflowAdsDistracting,StackOverflowModeration,StackOverflowCommunity,StackOverflowHelpful,StackOverflowBetter,StackOverflowWhatDo,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
13337,Student,"Yes, I program as a hobby",United States,No,"Not employed, and not looking for work",I never completed any formal education,,,,,1 to 2 years,,,,,,,,,,,,,,,,,"With a soft ""g,"" like ""jiff""",Agree,Strongly agree,Strongly agree,Agree,Agree,Strongly agree,Strongly agree,Disagree,Disagree,Somewhat agree,,Agree,Disagree,Strongly agree,Agree,Disagree,Agree,"I'm not actively looking, but I am open to new...",0.0,Not applicable/ never,Important,Important,Very important,Important,Important,Important,Important,Very important,Very important,Very important,Important,Important,Very important,Important,Important,Annual bonus; Equipment; Private office; Expec...,No,,,,,,,,,,,,,,U.S. dollars ($),,Tabs,,Online course; Self-taught,Official documentation; Stack Overflow Q&A; St...,,None of these,10:00 AM,,JavaScript; PHP; SQL,,,,,WordPress; Amazon Web Services (AWS),iOS; Mac OS; Raspberry Pi; WordPress; Amazon W...,Xcode,Turn on some music,,I don't use version control,,,,,,,,,Bugs found; Hours worked; Commit frequency; Cu...,Somewhat satisfied,Very satisfied,Very satisfied,Very satisfied,Very satisfied,Very satisfied,,,,,,,,,,,,"I have a login for Stack Overflow, but haven't...",6.0,Desktop; iOS browser,Once or twice,At least once each week,Haven't done at all,Haven't done at all,Haven't done at all,Several times,Haven't done at all,Several times,,Disagree,Agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Male,A master's degree,White or of European descent,Disagree,Somewhat agree,Disagree,Strongly agree,,100000.0
6336,Professional developer,"Yes, I program as a hobby",United States,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",Publicly-traded corporation,6 to 7 years,2 to 3 years,,Web developer,Full stack Web developer,,,8.0,8.0,,,,,,,,,"With a soft ""g,"" like ""jiff""",Agree,Agree,Agree,Agree,Agree,Somewhat agree,Agree,Somewhat agree,Somewhat agree,Somewhat agree,Agree,Somewhat agree,Somewhat agree,Agree,Agree,Somewhat agree,Agree,"I'm not actively looking, but I am open to new...",0.0,Between 2 and 4 years ago,Important,Important,Somewhat important,Very important,Very important,Important,Very important,Important,Important,Important,Important,Very important,Somewhat important,Important,Important,Annual bonus; Vacation/days off; Health benefi...,No,,I saw an employer's advertisement,A career fair or on-campus recruiting event,Somewhat important,Important,Important,Not at all important,Somewhat important,Important,Important,Somewhat important,Not very important,Important,,,Both,,,,,,9:00 AM,C#; Java; JavaScript; SQL,C#; Java; JavaScript; SQL; Swift; TypeScript,AngularJS,AngularJS; Xamarin; .NET Core,SQL Server; Oracle,SQL Server; Oracle,,,Sublime Text; Eclipse; Visual Studio,Turn on some music,Agile,Subversion,A few times a week,Somewhat agree,Strongly agree,Somewhat agree,Strongly agree,Agree,Agree,Agree,Bugs found; Customer satisfaction; On time/in ...,Somewhat satisfied,Satisfied,Very satisfied,Very satisfied,Very satisfied,Satisfied,Some influence,Some influence,Not much influence,Not much influence,Not much influence,No influence at all,No influence at all,Some influence,No influence at all,No influence at all,Some influence,"I have a login for Stack Overflow, but haven't...",10.0,Desktop,At least once each week,Haven't done at all,Several times,Several times,Several times,Haven't done at all,Haven't done at all,Haven't done at all,Somewhat agree,Somewhat agree,Disagree,Somewhat agree,Strongly agree,Strongly agree,Strongly agree,Disagree,Male,A master's degree,White or of European descent,Somewhat agree,Agree,Strongly disagree,Strongly agree,,


In [7]:
# Random sample for 206
survey_2016.sample(2)

Unnamed: 0,collector,country,un_subregion,so_region,age_range,age_midpoint,gender,self_identification,occupation,occupation_group,experience_range,experience_midpoint,salary_range,salary_midpoint,big_mac_index,tech_do,tech_want,aliens,programming_ability,employment_status,industry,company_size_range,team_size_range,women_on_team,remote,job_satisfaction,job_discovery,dev_environment,commit_frequency,hobby,dogs_vs_cats,desktop_os,unit_testing,rep_range,visit_frequency,why_learn_new_tech,education,open_to_new_job,new_job_value,job_search_annoyance,interview_likelihood,how_to_improve_interview_process,star_wars_vs_star_trek,agree_tech,agree_notice,agree_problemsolving,agree_diversity,agree_adblocker,agree_alcohol,agree_loveboss,agree_nightcode,agree_legacy,agree_mars,important_variety,important_control,important_sameend,important_newtech,important_buildnew,important_buildexisting,important_promotion,important_companymission,important_wfh,important_ownoffice,developer_challenges,why_stack_overflow
7570,Facebook,Sweden,Northern Europe,Western Europe,30-34,32.0,Male,Developer; Engineer; Programmer; Sr. Developer...,Desktop developer,Desktop developer,6 - 10 years,8.0,"$70,000 - $80,000",75000.0,5.23,C++; C#; SQL,C#; SQL,Yes,8.0,Employed full-time,Manufacturing,20-99 employees,1-4 people,0,Part-time remote,I'm somewhat satisfied with my job,A friend referred me,Notepad++; Visual Studio,Once a day,1-2 hours per week,Dogs,Windows 7,Yes,2 - 100,Multiple times a day,I want to be a better developer,Masters Degree in Computer Science (or related...,"I'm not actively looking, but I am open to new...",Salary; Opportunity for advancement; Building ...,Writing my CV and keeping it updated,10%,Show me more live code; Introduce me to the te...,Star Wars,Agree somewhat,Agree completely,Agree somewhat,Agree somewhat,Agree somewhat,Disagree somewhat,Neutral,Disagree somewhat,Disagree somewhat,Disagree completely,This is somewhat important,This is very important,I don't care about this,This is somewhat important,This is very important,This is somewhat important,I don't care about this,This is somewhat important,I don't care about this,I don't care about this,Poor scheduling; Unrealistic expectations; Cha...,To get help for my job; Beacause I love to learn
42323,Meta Stack Overflow Post,United States,North America,North America,35-39,37.0,Male,Sr. Developer,Back-end web developer,Back-end web developer,11+ years,13.0,"$120,000 - $130,000",125000.0,4.93,C#; SQL; SQL Server,C#; F#; Rust; SQL; SQL Server,No,9.0,Employed full-time,Media / Advertising,20-99 employees,5-9 people,0,Full-time remote,I love my job,A friend referred me,Sublime; Visual Studio,Multiple times a day,,Dogs,Windows 10,Yes,"1,001 - 5,000",Once a day,To keep my skills up to date,Masters Degree in Computer Science (or related...,I am not interested in new job opportunities,Salary; Company culture; Remote working option,The interview process,100%,Fewer brainteasers; Prepare me for who I will ...,Star Wars; Star Trek,Agree somewhat,Agree completely,Agree completely,Neutral,Agree completely,Disagree somewhat,Agree somewhat,Disagree somewhat,Neutral,Disagree completely,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is very important,This is somewhat important,This is very important,This is very important,This is very important,Poor team dynamics; Unrealistic expectations; ...,Because I can't do my job without it


Now we have a better sense of what the data looks like, so we can proceed to picking the columns that we will need for the analysis, given the questions we outlines in [Section 1.](#business). These are the following:

The columns that indicate what language respondants want to learn are the following:
> Survey 2020: LanguageWorkedWith \
> Survey 2019: LanguageWorkedWith \
> Survey 2018: LanguageWorkedWith \
> Survey 2017: HaveWorkedLanguage \
> Survey 2016: tech_do

The columns that indicate how satisfied a respondant is with their job are:
> Survey 2020: JobSat \
> Survey 2019: JobSat \
> Survey 2018: JobSatisfaction \
> Survey 2017: JobSatisfaction \
> Survey 2016: job_satisfaction

The columns that indicate education status are the following:
> Survey 2020: EdLevel \
> Survey 2019: EdLevel \
> Survey 2018: FormalEducation \
> Survey 2017: FormalEducation \
> Survey 2016: education

The columns that indicate where the respondant lives are:
> Survey 2020: Country \
> Survey 2019: Country \
> Survey 2018: Country \
> Survey 2017: Country \
> Survey 2016: country

The columns that indicate what is the respondant's developer status are:
> Survey 2020: Gender \
> Survey 2019: Gender \
> Survey 2018: Gender \
> Survey 2017: Gender \
> Survey 2016: gender

The columns that indicate what is the respondant's employment status are:
> Survey 2020: Employment \
> Survey 2019: Employment \
> Survey 2018: Employment \
> Survey 2017: EmploymentStatus \
> Survey 2016: employment_status


Lastly, we can take a loop at the shapefiles we imported:

In [8]:
# Head of shapefiles
map_df.head()

Unnamed: 0,OBJECTID,CNTRY_NAME,CNTRY_CODE,BPL_CODE,geometry
0,1,Algeria,12,13010.0,"MULTIPOLYGON (((-2.05592 35.07370, -2.05675 35..."
1,2,Angola,24,12010.0,"MULTIPOLYGON (((12.79760 -4.41685, 12.79875 -4..."
2,3,In dispute South Sudan/Sudan,9999,99999.0,"POLYGON ((28.08408 9.34722, 28.03889 9.34722, ..."
3,4,Benin,204,15010.0,"MULTIPOLYGON (((1.93753 6.30122, 1.93422 6.299..."
4,5,Botswana,72,14010.0,"POLYGON ((25.16312 -17.77816, 25.16383 -17.778..."


So the countries are in the variable CNTRY_NAME. We will need to match these with our data and for that we will have to harmonize all the country names in [Section 3.](#prepare)

Given all of the above we can proceed to prepare our data!

<a name="prepare"></a>
## 3. Prepare data

Since we already know which columns we will need in order to answer our questions we can first start by droping columns that are not relevant to us:

In [9]:
# Put relevant variables in list
keep_2020 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2019 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2018 = ['LanguageWorkedWith', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'Employment']
keep_2017 = ['HaveWorkedLanguage', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'EmploymentStatus']
keep_2016 = ['tech_do', 'job_satisfaction', 'education', 'country', 'gender', 'employment_status']

# Keep only relevant variables
survey_2020 = survey_2020[keep_2020]
survey_2019 = survey_2019[keep_2019]
survey_2018 = survey_2018[keep_2018]
survey_2017 = survey_2017[keep_2017]
survey_2016 = survey_2016[keep_2016]

Nice! Now we can rename the columns so that all dataframes have the same names for variables

In [10]:
# Rename columns
survey_2020.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2019.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2018.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2017.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'EmploymentStatus': 'employment'}, inplace = True)
survey_2016.rename(columns={'tech_do': 'languages', 'job_satisfaction': 'job_satisfaction', 
                           'education': 'education', 'country': 'country', 'gender': 'gender', 
                           'employment_status': 'employment'}, inplace = True);

Now we need to harmonize the answers to different questions for all survey years in order to merge them and have a complete data set. Let's with some an easy one and look at the category for gender in each year.

In [11]:
# Print unique gender categories in 2020
survey_2020['gender'].unique()

array(['Man', nan, 'Woman',
       'Man;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man'], dtype=object)

In [12]:
# Print unique gender categories in 2019
survey_2019['gender'].unique()

array(['Man', nan, 'Woman',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man',
       'Man;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [13]:
# Print unique gender categories in 2018
survey_2018['gender'].unique()

array(['Male', nan, 'Female',
       'Female;Male;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Male',
       'Male;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming', 'Transgender',
       'Female;Transgender',
       'Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Non-binary, genderqueer, or gender non-conforming',
       'Female;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender', 'Female;Male;Transgender',
       'Female;Male;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [14]:
# Print unique gender categories in 2017
survey_2017['gender'].unique()

array(['Male', nan, 'Female', 'Gender non-conforming', 'Other',
       'Male; Gender non-conforming', 'Female; Transgender',
       'Male; Female', 'Male; Other', 'Transgender',
       'Transgender; Gender non-conforming',
       'Female; Gender non-conforming',
       'Male; Female; Transgender; Gender non-conforming; Other',
       'Male; Female; Transgender', 'Male; Female; Other',
       'Male; Female; Transgender; Gender non-conforming',
       'Male; Transgender', 'Female; Transgender; Gender non-conforming',
       'Gender non-conforming; Other',
       'Male; Female; Gender non-conforming', 'Female; Other',
       'Male; Transgender; Gender non-conforming', 'Transgender; Other',
       'Male; Gender non-conforming; Other',
       'Female; Gender non-conforming; Other',
       'Male; Female; Gender non-conforming; Other',
       'Female; Transgender; Other',
       'Female; Transgender; Gender non-conforming; Other',
       'Male; Transgender; Other', 'Male; Female; Transgender;

In [15]:
# Print unique gender categories in 2016
survey_2016['gender'].unique()

array(['Male', nan, 'Female', 'Prefer not to disclose', 'Other'],
      dtype=object)

Given what we see above, let's cluster all in the following four categories: female, male, other, nan. We can define a function to assing the value of Male, Female, Other or nan.

In [16]:
# Define function to harmonize gender
def harmonize_gender(df_raw):
    '''This function unifies all gender categories into 
    four: Male, Female, Other and nan
    '''
    # Copy df_raw
    df = df_raw.copy()
    # Loop over rows
    for i in tqdm(df.index):
        # Define gender
        gender = str(df.loc[i, 'gender']).lower()
        # Value if male or man
        if gender == 'male' or gender == 'man':
            df.loc[i, 'gender'] = 'Male'
        # Value if female or woman
        elif gender == 'female' or gender == 'woman':
            df.loc[i, 'gender'] = 'Female'
        # Assign null values
        elif gender == 'nan':
            df.loc[i, 'gender'] = np.nan
        # Other categories lumped into other
        else:
            df.loc[i, 'gender'] = 'Other'
    # Return harmonized dataframe
    return(df)

# Apply gender harmonizer
survey_2020 = harmonize_gender(survey_2020)
survey_2019 = harmonize_gender(survey_2019)
survey_2018 = harmonize_gender(survey_2018)
survey_2017 = harmonize_gender(survey_2017)
survey_2016 = harmonize_gender(survey_2016)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88883.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=98855.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=51392.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56030.0), HTML(value='')))




Similarly, for job satisfaction we can look at the possible values

In [17]:
# Print unique job satisfaction categories in 2020
survey_2020['job_satisfaction'].unique()

array(['Slightly satisfied', 'Very dissatisfied', nan,
       'Slightly dissatisfied', 'Very satisfied',
       'Neither satisfied nor dissatisfied'], dtype=object)

In [18]:
# Print unique job satisfaction categories in 2019
survey_2019['job_satisfaction'].unique()

array([nan, 'Slightly satisfied', 'Slightly dissatisfied',
       'Neither satisfied nor dissatisfied', 'Very satisfied',
       'Very dissatisfied'], dtype=object)

In [19]:
# Print unique job satisfaction categories in 2018
survey_2018['job_satisfaction'].unique()

array(['Extremely satisfied', 'Moderately dissatisfied',
       'Moderately satisfied', 'Neither satisfied nor dissatisfied',
       'Slightly satisfied', nan, 'Slightly dissatisfied',
       'Extremely dissatisfied'], dtype=object)

In [20]:
# Print unique job satisfaction categories in 2017
survey_2017['job_satisfaction'].unique()

array([nan,  9.,  3.,  8.,  6.,  7.,  5.,  4., 10.,  2.,  0.,  1.])

In [21]:
# Print unique job satisfaction categories in 2016
survey_2016['job_satisfaction'].unique()

array([nan, 'I love my job', "I don't have a job",
       "I'm somewhat satisfied with my job",
       "I'm somewhat dissatisfied with my job",
       "I'm neither satisfied nor dissatisfied", 'Other (please specify)',
       'I hate my job'], dtype=object)

We will try to lump all categories into six categories: Very satisfied, satisfied, Neither, Dissatisfied, Very Dissatisfied and nan. We will take a similar approach to gender and define a function to do this.

In [22]:
# Define function to harmonize job satisfaction
def harmonize_jobsatisfaction(df_raw):
    '''This function harmonizes all the job
    satisfaction responses into: Very satisfied,
    Satisfied, Neither, Dissatisfied, Very dissatisfied
    and nan'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    v_satisfied = ['very satisfied', 'extremely satisfied', 'i love my job', '10', '9']
    satisfied = ['slightly satisfied', 'moderately satisfied', 'i\'m somewhat satisfied with my job', '8', '7']
    neither = ['neither satisfied nor dissatisfied', 'i\'m neither satisfied nor dissatisfied', '6', '5', '4']
    dissatisfied = ['slightly dissatisfied', 'moderately dissatisfied', 'i\'m somewhat dissatisfied with my job', '3', '2']
    v_dissatisfied = ['very dissatisfied', 'extremely dissatisfied', 'i hate my job', '1', '0']
    # Loop over rows
    for i in tqdm(df.index):
        # Define job satisfaction
        job_satisfac = str(df.loc[i, 'job_satisfaction']).lower()
        # Value if very satisfied
        if job_satisfac in v_satisfied:
            df.loc[i, 'job_satisfaction'] = 'Very satisfied'
        # Value if satisfied
        elif job_satisfac in satisfied:
            df.loc[i, 'job_satisfaction'] = 'Satisfied'
        # Value if neither
        elif job_satisfac in neither:
            df.loc[i, 'job_satisfaction'] = 'Neither'
        # Value if dissatisfied
        elif job_satisfac in dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Dissatisfied'
        # Value if very dissatisfied
        elif job_satisfac in v_dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Very Dissatisfied'
        # Other categories become np.nan values
        else:
            df.loc[i, 'job_satisfaction'] = np.nan
    # Return harmonized dataframe
    return(df)
    
    
# Apply job satisfaction harmonizer
survey_2020 = harmonize_jobsatisfaction(survey_2020)
survey_2019 = harmonize_jobsatisfaction(survey_2019)
survey_2018 = harmonize_jobsatisfaction(survey_2018)
survey_2017 = harmonize_jobsatisfaction(survey_2017)
survey_2016 = harmonize_jobsatisfaction(survey_2016)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88883.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=98855.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=51392.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56030.0), HTML(value='')))




Next, let's look at the employment variables and how they are layed out

In [42]:
# Print unique employment categories in 2020
survey_2020['employment'].unique()

Independent contractor, freelancer, or self-employed
Employed full-time
nan
Student
Not employed, but looking for work
Employed part-time
Retired
Not employed, and not looking for work


In [44]:
# Print unique employment categories in 2019
survey_2019['employment'].unique()

Not employed, and not looking for work
Not employed, but looking for work
Employed full-time
Independent contractor, freelancer, or self-employed
nan
Employed part-time
Retired


In [45]:
# Print unique employment categories in 2018
survey_2018['employment'].unique()

Employed part-time
Employed full-time
Independent contractor, freelancer, or self-employed
Not employed, and not looking for work
Not employed, but looking for work
nan
Retired


In [46]:
# Print unique employment categories in 2017
survey_2017['employment'].unique()

Not employed, and not looking for work
Employed part-time
Employed full-time
Independent contractor, freelancer, or self-employed
Not employed, but looking for work
I prefer not to say
Retired


In [47]:
# Print unique employment categories in 2016
survey_2016['employment'].unique()

nan
Employed full-time
Freelance / Contractor
Self-employed
I'm a student
Unemployed
Prefer not to disclose
Employed part-time
Other (please specify)
Retired


Now we have to deal with the responses for employment status. This one seems a bit trickier as answer categories have changed over the years. With that in mind, let's create a function to harmonize these categories into the folowing: Full-time, Part-time, Self-employed, Not emplyed, Other and nan.

In [49]:
# Define function to harmonize employment categories
def harmonize_employment(df_raw):
    '''This function harmonizes all employment responses
    into: Full-time, Part-time, Self-employed, Not employed,
    Other and nan'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    full_time = ['employed full-time']
    part_time = ['employed part-time']
    self_empl = ['independent contractor, freelancer, or self-employed', 'freelance / contractor', 'self-employed']
    not_employed = ['not employed, but looking for work', 'not employed, and not looking for work', 'unemployed']
    other = ['student', 'i\'m a student', 'retired', 'i prefer not to say', 'prefer not to disclose', 'other (please specify)']
    # Loop over rows
    for i in tqdm(df.index):
        # Define employment
        employment = str(df.loc[i, 'employment']).lower()
        # Value if full-time
        if employment in full_time:
            df.loc[i, 'employment'] = 'Full-time'
        # Value if part-time
        elif employment in part_time:
            df.loc[i, 'employment'] = 'Part-time'
        # Value if self-employed
        elif employment in self_empl:
            df.loc[i, 'employment'] = 'Self-employed'
        # Value if not employed
        elif employment in not_employed:
            df.loc[i, 'employment'] = 'Not employed'
        # Value if other
        elif employment in other:
            df.loc[i, 'employment'] = 'Other'
        # Other categories become np.nan values
        else:
            df.loc[i, 'employment'] = np.nan
    # Return harmonized dataframe
    return(df)

# Apply employment harmonizer
survey_2020 = harmonize_employment(survey_2020)
survey_2019 = harmonize_employment(survey_2019)
survey_2018 = harmonize_employment(survey_2018)
survey_2017 = harmonize_employment(survey_2017)
survey_2016 = harmonize_employment(survey_2016)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88883.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=98855.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=51392.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56030.0), HTML(value='')))




Now let's look at education variables and their respective values.

In [50]:
# Print unique education categories in 2020
survey_2020['education'].unique()

array(['Master’s degree (M.A., M.S., M.Eng., MBA, etc.)',
       'Bachelor’s degree (B.A., B.S., B.Eng., etc.)', nan,
       'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
       'Professional degree (JD, MD, etc.)',
       'Some college/university study without earning a degree',
       'Associate degree (A.A., A.S., etc.)',
       'Other doctoral degree (Ph.D., Ed.D., etc.)',
       'Primary/elementary school',
       'I never completed any formal education'], dtype=object)

In [51]:
# Print unique education categories in 2019
survey_2019['education'].unique()

array(['Primary/elementary school',
       'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
       'Bachelor’s degree (BA, BS, B.Eng., etc.)',
       'Some college/university study without earning a degree',
       'Master’s degree (MA, MS, M.Eng., MBA, etc.)',
       'Other doctoral degree (Ph.D, Ed.D., etc.)', nan,
       'Associate degree', 'Professional degree (JD, MD, etc.)',
       'I never completed any formal education'], dtype=object)

In [52]:
# Print unique education categories in 2018
survey_2018['education'].unique()

array(['Bachelor’s degree (BA, BS, B.Eng., etc.)', 'Associate degree',
       'Some college/university study without earning a degree',
       'Master’s degree (MA, MS, M.Eng., MBA, etc.)',
       'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
       nan, 'Primary/elementary school',
       'Professional degree (JD, MD, etc.)',
       'I never completed any formal education',
       'Other doctoral degree (Ph.D, Ed.D., etc.)'], dtype=object)

In [53]:
# Print unique education categories in 2017
survey_2017['education'].unique()

array(['Secondary school',
       "Some college/university study without earning a bachelor's degree",
       "Bachelor's degree", 'Doctoral degree', "Master's degree",
       'Professional degree', 'Primary/elementary school',
       'I prefer not to answer', 'I never completed any formal education'],
      dtype=object)

In [65]:
# Print unique education categories in 2016
survey_2016['education'].unique()

array([nan,
       "I'm self-taught; On-the-job training; B.S. in Computer Science (or related field)",
       "I'm self-taught; On-the-job training", "I'm self-taught",
       'B.S. in Computer Science (or related field)',
       "I'm self-taught; On-the-job training; Online class (e.g. Coursera, Codecademy, Khan Academy, etc.); B.S. in Computer Science (or related field)",
       "I'm self-taught; Online class (e.g. Coursera, Codecademy, Khan Academy, etc.); B.A. in Computer Science (or related field); B.S. in Computer Science (or related field)",
       "I'm self-taught; On-the-job training; Masters Degree in Computer Science (or related field)",
       "I'm self-taught; Online class (e.g. Coursera, Codecademy, Khan Academy, etc.)",
       'Masters Degree in Computer Science (or related field)',
       "I'm self-taught; B.A. in Computer Science (or related field)",
       'B.A. in Computer Science (or related field)',
       "I'm self-taught; On-the-job training; Some college course

The answers for 2016 look very different from the previous years. This is probably because respondants were allowed to tick more than one box.

Lastly, in order to be able to match the survey data we have with the geodata we imported we need to harminze the name of the countries. In order to do that, we can use the [country_converter library](https://pypi.org/project/country-converter/). We define the following function and apply it to the country values

In [None]:
# Define function to convert country name into ISO3
def country_iso3(df_raw, df_type = 'survey'):
    '''This function createts ISO3 country values 
    column'''
    # Check if df_type valid
    if df_type != 'survey' or df_survey != 'map':
        ## Exception
    # Copy df_raw
    df = df_raw.copy()
    # If survey is passed
    if df_type = 'survey':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'country'])
            # Create to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, 'ISO3')
    # If map is passed
    elif df_type = 'map':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'CNTRY_NAME'])
            # Convert to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, 'ISO3')
    # Return dataframe
    return(df)

# Convert ISO3 
        


With that in hands we can add a variable at the end of each dataset to mark the year it represents and merge them

In [None]:
# Add year variable to dataframes
survey_2020['year'] = 2020
survey_2019['year'] = 2019
survey_2018['year'] = 2018
survey_2017['year'] = 2017
survey_2016['year'] = 2016

# Merge datasets into one
data = [survey_2020, survey_2019, survey_2018, survey_2017, survey_2016]
survey = pd.concat(data)

Another thing that we need to prepare in order to later match the surveys and the shapefile datasets is the country names. In order to have exact matches between the country names in the survey and shapefile data we can convert all the names using a regex to an ISO3 name and then match them. Luckily, the packate country-converter does that for us.

In [None]:
# List of survey files
dfs = [survey_2020, survey_2019, survey_2018, survey_2017, survey_2016]



<a name="model"></a>
## 4. Data modeling

Text text

<a name="eval"></a>
## 5. Evaluate the results

Text text

<a name="deploy"></a>
## 6. Deploy

Text text

In [None]:
os.getcwd()

In [None]:
import geopandas as gpd

In [None]:
map_df = gpd.read_file('IPUMSI_world_release2017/world_countries_2017.shp')

In [None]:
ax = map_df.plot()
ax.axis('off');

In [None]:
map_df.head()

In [None]:
countries = map_df['CNTRY_NAME'].unique().tolist()

In [None]:
'Bahamas' in countries

In [None]:
countries2 = survey_2020['Country'].unique().tolist()

In [None]:
import country_converter as coco

iso_lst1 = []
iso_lst2 = []

for country in countries:
    iso1 = coco.convert(names=country, to='ISO3')
    iso_lst1.append(iso1)

for country2 in countries2:
    iso2 = coco.convert(names=country2, to='ISO3')
    iso_lst2.append(iso2)


In [None]:
for i in iso_lst2:
    print(i, i in iso_lst1)

In [None]:
'US' in iso_lst1

In [None]:
coco.convert(names='United States of America', to='ISO3')

In [None]:
'USA' in iso_lst2

In [None]:
for i in survey_2020.index:
    survey_2020.loc[i, 'Country'] = coco.convert(names = str(survey_2020.loc[i, 'Country']), to = 'ISO3')

In [None]:
survey_2020.head()