# Title here

Description here

## Table of contents
- [1. Business undestanding](#business)
- [2. Data understanding](#data)
    - [2.1. Gathering data](#gather)
    - [2.2. Assessing data](#assess)
- [3. Prepare data](#prepare)
- [4. Data modeling](#model)
- [5. Evaluate the results](#eval)
- [6. Deploy](#deploy)

<a name="business"></a>
## 1. Business understanding

Text text

> Question 1 \
> Question 2 \
> Question 3 \
> Question 4

<a name="data"></a>
## 2. Data understanding

Text text

<a name="gather"></a>
   

<a name="gather"></a>
### 2.1. Gathering data

First, we need to download all the necessary data. In order to do so, we can run the line below to download all Stack Overflow surveys for all years:

In [None]:
# Download survey data
%run -i '../download/download.py'

# Download shape files
%run -i '../download/shape.py'

These are all the surveys since 2011. We will only use the ones from the last five years. One of the reasons for doing so is that the structure of the survey changed and similar questions might not be comparable anymore. Next, in preparation for the next sections we can import the relevant libraries.

In [1]:
# Import libraries
import country_converter as coco
import geopandas as gpd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
%matplotlib inline

<a name="assess"></a>
### 2.2. Assessing data

Now that we have downloaded all the datasets let's start by reading the csvs from the past five years. In doing so, I am ignoring the first column as it serves as an ordered identifier for the respondants.

In [2]:
# Import survey data and skip first column
import warnings; warnings.simplefilter('ignore')
survey_2016 = pd.read_csv("../data/survey/survey_2016.csv").iloc[:, 1:]
survey_2017 = pd.read_csv("../data/survey/survey_2017.csv").iloc[:, 1:]
survey_2018 = pd.read_csv("../data/survey/survey_2018.csv").iloc[:, 1:]
survey_2019 = pd.read_csv("../data/survey/survey_2019.csv").iloc[:, 1:]
survey_2020 = pd.read_csv("../data/survey/survey_2020.csv").iloc[:, 1:]

# Import shapefile with geopandas
map_df = gpd.read_file("../data/shapefile/world_countries_2017.shp")

Great! Now we can quickly look at what these datasets look like. I will do that by picking two random samples from the survey.

In [3]:
# Show dataframe for two random samples for 2020
pd.options.display.max_columns = None # to show all columns
survey_2020.sample(2)

Unnamed: 0,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,CurrencySymbol,DatabaseDesireNextYear,DatabaseWorkedWith,DevType,EdLevel,Employment,Ethnicity,Gender,JobFactors,JobSat,JobSeek,LanguageDesireNextYear,LanguageWorkedWith,MiscTechDesireNextYear,MiscTechWorkedWith,NEWCollabToolsDesireNextYear,NEWCollabToolsWorkedWith,NEWDevOps,NEWDevOpsImpt,NEWEdImpt,NEWJobHunt,NEWJobHuntResearch,NEWLearn,NEWOffTopic,NEWOnboardGood,NEWOtherComms,NEWOvertime,NEWPurchaseResearch,NEWPurpleLink,NEWSOSites,NEWStuck,OpSys,OrgSize,PlatformDesireNextYear,PlatformWorkedWith,PurchaseWhat,Sexuality,SOAccount,SOComm,SOPartFreq,SOVisitFreq,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
14616,"I am not primarily a developer, but I write co...",Yes,,15,Monthly,,,Russian Federation,Russian ruble,RUB,MySQL;SQLite,MySQL;SQLite,Data scientist or machine learning specialist;...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,White or of European descent,Woman,"Languages, frameworks, and other technologies ...",Neither satisfied nor dissatisfied,I am actively looking for a job,Python;R;SQL,HTML/CSS;PHP;Python;SQL,Pandas,Pandas,Confluence;Jira;Github;Gitlab;Slack;Stack Over...,Confluence;Jira;Github;Gitlab;Stack Overflow f...,No,Somewhat important,Fairly important,Wanting to share accomplishments with a wider ...,"Read company media, such as employee blogs or ...",Once every few years,Not sure,Onboarding? What onboarding?,No,Occasionally: 1-2 days per quarter but less th...,,Amused,Stack Overflow (public Q&A for anyone who code...,Call a coworker or friend;Visit Stack Overflow...,Windows,"1,000 to 4,999 employees",Google Cloud Platform;iOS;MacOS;Raspberry Pi;W...,Raspberry Pi;Windows,I have little or no influence,Straight / Heterosexual,Yes,Neutral,I have never participated in Q&A on Stack Over...,Less than once per month or monthly,Neither easy nor difficult,Appropriate in length,No,"A business discipline (such as accounting, fin...",,,Just as welcome now as I felt last year,40.0,4,1
14760,I am a developer by profession,No,23.0,15,Monthly,2086.0,27060.0,Belgium,European Euro,EUR,,,"Developer, game or graphics","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,White or of European descent,Man,"Languages, frameworks, and other technologies ...",Very dissatisfied,"I’m not actively looking, but I am open to new...",C#;C++;Python,C#;C++;Python,.NET;Unity 3D;Unreal Engine,.NET;Unity 3D;Unreal Engine,,,No,Extremely important,Fairly important,Curious about other opportunities;Better compe...,"Read company media, such as employee blogs or ...",Once every few years,Not sure,Yes,No,Rarely: 1-2 days per year or less,Start a free trial;Ask developers I know/work ...,"Hello, old friend",Stack Overflow (public Q&A for anyone who code...,Call a coworker or friend;Visit Stack Overflow...,Windows,10 to 19 employees,Windows,Android;iOS;Windows,I have a great deal of influence,Straight / Heterosexual,No,"No, not at all",,Daily or almost daily,Easy,Appropriate in length,No,"Computer science, computer engineering, or sof...",,,Just as welcome now as I felt last year,38.0,7,2


And for the remaining years we see:

In [4]:
# Random sample for 2019
survey_2019.sample(2)

Unnamed: 0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
30244,I am a developer by profession,Yes,Never,,Employed full-time,United Kingdom,No,Some college/university study without earning ...,"A humanities discipline (ex. literature, histo...",Taken an online course in programming or softw...,"10,000 or more employees","Developer, back-end",Less than 1 year,24,Less than 1 year,Very satisfied,Slightly satisfied,Somewhat confident,Yes,Yes,"I’m not actively looking, but I am open to new...",Less than a year ago,,No,"Languages, frameworks, and other technologies ...","My job status changed (promotion, new job, etc.)",GBP,Pound sterling,,,,,There is a schedule and/or spec (made by me or...,Inadequate access to necessary tools;Non-work ...,"Less than half the time, but at least one day ...",Office,Average,"Yes, because I see value in code review",,"No, but I think we should",Not sure,I have little or no influence,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;R...,C#;HTML/CSS;Python;Ruby;SQL,PostgreSQL,,AWS;MacOS;Slack;Windows,AWS;Heroku;MacOS;Raspberry Pi;Slack;Windows,ASP.NET;jQuery,ASP.NET;Ruby on Rails,.NET;.NET Core;Unity 3D,.NET;.NET Core;Unity 3D;Unreal Engine,Atom;Visual Studio;Visual Studio Code,Windows,I do not use containers,Not at all,,Yes,Yes,Yes,Instagram,Online,Username,2018,Daily or almost daily,Find answers to specific questions,3-5 times per week,Stack Overflow was much faster,11-30 minutes,No,,"No, I didn't know that Stack Overflow had a jo...","No, and I don't know what those are","Yes, somewhat",Somewhat less welcome now than last year,Tech meetups or events in your area;Courses on...,25.0,Woman,No,Bisexual,White or of European descent,No,Appropriate in length,Easy
62332,I am a developer by profession,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time,Canada,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Received on-the-job training in software devel...,2-9 employees,"Developer, full-stack",16,12,7,Very satisfied,Very satisfied,Very confident,No,No,I am not interested in new job opportunities,1-2 years ago,Write any code;Complete a take-home project;In...,Yes,Office environment or company culture;Remote w...,"My job status changed (promotion, new job, etc.)",CAD,Canadian dollar,6116.0,Monthly,56028.0,8.0,There's no schedule or spec; I work on what se...,"Meetings;Non-work commitments (parenting, scho...",All or almost all the time (I'm full-time remote),Home,Far above average,"Yes, because I see value in code review",1.0,"Yes, it's part of our process",Developers typically have the most influence o...,I have a great deal of influence,Bash/Shell/PowerShell;C++;Elixir;Go;HTML/CSS;J...,Bash/Shell/PowerShell;JavaScript;Ruby;Rust;Web...,,,AWS;Docker;Heroku;Kubernetes;Linux;MacOS,Docker;Linux;MacOS,React.js;Ruby on Rails,,Ansible;Node.js,,Emacs,MacOS,Testing,Not at all,An irresponsible use of resources,No,Yes,Yes,YouTube,In real life (in person),Handle,I don't remember,A few times per week,Find answers to specific questions,1-2 times per week,Stack Overflow was much faster,0-10 minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...","No, not at all",A lot less welcome now than last year,,28.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy


In [5]:
# Random sample for 2018
survey_2018.sample(2)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,AssessJob1,AssessJob2,AssessJob3,AssessJob4,AssessJob5,AssessJob6,AssessJob7,AssessJob8,AssessJob9,AssessJob10,AssessBenefits1,AssessBenefits2,AssessBenefits3,AssessBenefits4,AssessBenefits5,AssessBenefits6,AssessBenefits7,AssessBenefits8,AssessBenefits9,AssessBenefits10,AssessBenefits11,JobContactPriorities1,JobContactPriorities2,JobContactPriorities3,JobContactPriorities4,JobContactPriorities5,JobEmailPriorities1,JobEmailPriorities2,JobEmailPriorities3,JobEmailPriorities4,JobEmailPriorities5,JobEmailPriorities6,JobEmailPriorities7,UpdateCV,Currency,Salary,SalaryType,ConvertedSalary,CurrencySymbol,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AdsPriorities1,AdsPriorities2,AdsPriorities3,AdsPriorities4,AdsPriorities5,AdsPriorities6,AdsPriorities7,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
57187,No,No,India,"Yes, part-time",Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele...","10,000 or more employees",Full-stack developer,3-5 years,3-5 years,Moderately dissatisfied,Moderately dissatisfied,Working in a different or more specialized tec...,"I’m not actively looking, but I am open to new...",More than 4 years ago,8.0,4.0,7.0,5.0,6.0,2.0,9.0,1.0,10.0,3.0,1.0,10.0,2.0,9.0,11.0,3.0,5.0,7.0,8.0,4.0,6.0,2.0,1.0,5.0,4.0,3.0,5.0,2.0,3.0,1.0,7.0,6.0,4.0,I had a negative experience or interaction at ...,Indian rupees (₹),30000,Monthly,5640.0,INR,"Office / productivity suite (Microsoft Office,...",One to three months,Taken an online course in programming or softw...,,,,Agree,Strongly agree,Strongly agree,C#;SQL,JavaScript;Python;HTML;CSS,SQL Server,MongoDB;Amazon DynamoDB;Microsoft Azure (Table...,Windows Desktop or Server,AWS;Azure;Linux,.NET Core,Angular;Node.js,Notepad++;Sublime Text;Visual Studio,Windows,1.0,Formal standard such as ISO 9001 or IEEE 12207...,Git;Team Foundation Version Control,Less than once per month,Yes,Yes,The website I was visiting forced me to disabl...,Somewhat agree,Neither agree nor disagree,Neither agree nor disagree,Clicked on an online advertisement;Saw an onli...,2.0,6.0,3.0,1.0,4.0,5.0,7.0,Increasing automation of jobs,,A governmental or other regulatory body,I'm excited about the possibilities more than ...,No,"Yes, but only within the company",Upper management at the company/organization,Yes,10 (Very Likely),A few times per week,Yes,I have never participated in Q&A on Stack Over...,"No, I knew that Stack Overflow had a jobs boar...","No, I know what it is but I don't have one",,Yes,Very interested,Somewhat interested,Extremely interested,Extremely interested,Extremely interested,Between 8:01 - 9:00 AM,Over 12 hours,1 - 2 hours,Never,Wrist/hand supports or braces,I don't typically exercise,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",South Asian,25 - 34 years old,No,,The survey was an appropriate length,Very easy
61617,Yes,Yes,France,No,Employed full-time,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",500 to 999 employees,Back-end developer;DevOps specialist,6-8 years,0-2 years,Slightly satisfied,Slightly satisfied,Working as a founder or co-founder of my own c...,"I’m not actively looking, but I am open to new...",Between 1 and 2 years ago,4.0,10.0,9.0,1.0,2.0,7.0,3.0,8.0,5.0,6.0,1.0,5.0,2.0,11.0,6.0,9.0,4.0,3.0,10.0,7.0,8.0,4.0,2.0,5.0,3.0,1.0,5.0,3.0,1.0,4.0,2.0,7.0,6.0,I had a negative experience or interaction at ...,Euros (€),31000,Yearly,37940.0,EUR,,One to three months,Participated in a hackathon,,,To improve my knowledge of a specific programm...,,,Disagree,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
# Random sample for 2017
survey_2017.sample(2)

Unnamed: 0,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,CompanyType,YearsProgram,YearsCodedJob,YearsCodedJobPast,DeveloperType,WebDeveloperType,MobileDeveloperType,NonDeveloperType,CareerSatisfaction,JobSatisfaction,ExCoderReturn,ExCoderNotForMe,ExCoderBalance,ExCoder10Years,ExCoderBelonged,ExCoderSkills,ExCoderWillNotCode,ExCoderActive,PronounceGIF,ProblemSolving,BuildingThings,LearningNewTech,BoringDetails,JobSecurity,DiversityImportant,AnnoyingUI,FriendsDevelopers,RightWrongWay,UnderstandComputers,SeriousWork,InvestTimeTools,WorkPayCare,KinshipDevelopers,ChallengeMyself,CompetePeers,ChangeWorld,JobSeekingStatus,HoursPerWeek,LastNewJob,AssessJobIndustry,AssessJobRole,AssessJobExp,AssessJobDept,AssessJobTech,AssessJobProjects,AssessJobCompensation,AssessJobOffice,AssessJobCommute,AssessJobRemote,AssessJobLeaders,AssessJobProfDevel,AssessJobDiversity,AssessJobProduct,AssessJobFinances,ImportantBenefits,ClickyKeys,JobProfile,ResumePrompted,LearnedHiring,ImportantHiringAlgorithms,ImportantHiringTechExp,ImportantHiringCommunication,ImportantHiringOpenSource,ImportantHiringPMExp,ImportantHiringCompanies,ImportantHiringTitles,ImportantHiringEducation,ImportantHiringRep,ImportantHiringGettingThingsDone,Currency,Overpaid,TabsSpaces,EducationImportant,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,CousinEducation,WorkStart,HaveWorkedLanguage,WantWorkLanguage,HaveWorkedFramework,WantWorkFramework,HaveWorkedDatabase,WantWorkDatabase,HaveWorkedPlatform,WantWorkPlatform,IDE,AuditoryEnvironment,Methodology,VersionControl,CheckInCode,ShipIt,OtherPeoplesCode,ProjectManagement,EnjoyDebugging,InTheZone,DifficultCommunication,CollaborateRemote,MetricAssess,EquipmentSatisfiedMonitors,EquipmentSatisfiedCPU,EquipmentSatisfiedRAM,EquipmentSatisfiedStorage,EquipmentSatisfiedRW,InfluenceInternet,InfluenceWorkstation,InfluenceHardware,InfluenceServers,InfluenceTechStack,InfluenceDeptTech,InfluenceVizTools,InfluenceDatabase,InfluenceCloud,InfluenceConsultants,InfluenceRecruitment,InfluenceCommunication,StackOverflowDescribes,StackOverflowSatisfaction,StackOverflowDevices,StackOverflowFoundAnswer,StackOverflowCopiedCode,StackOverflowJobListing,StackOverflowCompanyPage,StackOverflowJobSearch,StackOverflowNewQuestion,StackOverflowAnswer,StackOverflowMetaChat,StackOverflowAdsRelevant,StackOverflowAdsDistracting,StackOverflowModeration,StackOverflowCommunity,StackOverflowHelpful,StackOverflowBetter,StackOverflowWhatDo,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
32039,Professional developer,"Yes, both",India,No,Employed full-time,Master's degree,Computer engineering or electrical/electronics...,A few days each month,"10,000 or more employees",Publicly-traded corporation,6 to 7 years,3 to 4 years,,Embedded applications/devices developer; Machi...,,,,8.0,9.0,,,,,,,,,"With a soft ""g,"" like ""jiff""",Agree,Strongly agree,Somewhat agree,Disagree,Strongly disagree,Agree,Strongly agree,Somewhat agree,Agree,Disagree,Agree,Agree,Strongly disagree,Agree,Agree,Disagree,Agree,,,,,,,,,,,,,,,,,,,,Yes,LinkedIn,I was just giving it a regular update,"A friend, family member, or former colleague t...",Important,Important,Important,Important,Not very important,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Important,Indian rupees (?),,Both,Somewhat important,Online course; Self-taught; Coding competition...,Official documentation; Trade book; Textbook; ...,,Take online courses; Participate in online cod...,10:00 AM,C; C++; Python; R,C++; Python,,,MongoDB; Redis; MySQL,Redis; MySQL,Linux Desktop; Mac OS; Arduino; Raspberry Pi; ...,Linux Desktop; Mac OS; Amazon Web Services (AWS),Sublime Text; Vim; IPython / Jupyter,Turn on some music,,,,,,,,,,,,Satisfied,Very satisfied,Satisfied,Very satisfied,Satisfied,Somewhat satisfied,Some influence,No influence at all,Not much influence,No influence at all,Some influence,No influence at all,No influence at all,No influence at all,Not much influence,No influence at all,No influence at all,I have created a CV or Developer Story on Stac...,8.0,Desktop,At least once each week,At least once each week,Haven't done at all,Once or twice,Once or twice,Once or twice,Haven't done at all,Haven't done at all,Somewhat agree,Somewhat agree,Disagree,Somewhat agree,Agree,Agree,Agree,Somewhat agree,Male,A master's degree,,Somewhat agree,Somewhat agree,Strongly disagree,Strongly agree,,
21401,Professional developer,"Yes, I contribute to open source projects",Germany,No,Employed full-time,Some college/university study without earning ...,Computer science or software engineering,A few days each month,20 to 99 employees,"Privately-held limited company, not in startup...",6 to 7 years,3 to 4 years,,Web developer; Machine learning specialist,,,,7.0,5.0,,,,,,,,,"With a soft ""g,"" like ""jiff""",Agree,Somewhat agree,Agree,Somewhat agree,Agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,Disagree,Somewhat agree,Somewhat agree,Somewhat agree,Somewhat agree,,,,,,,,,,,,,,,,,,,,No,,,,,,,,,,,,,,Euros (€),,Tabs,Somewhat important,Online course; Industry certification; On-the-...,Official documentation; Trade book; Textbook; ...,,Take online courses; Contribute to open source...,11:00 AM,Java; JavaScript; Python; Ruby; SQL,Go; JavaScript; Python; Ruby; SQL,AngularJS,AngularJS; React,MySQL; PostgreSQL,Redis; PostgreSQL,Mac OS; Amazon Web Services (AWS),Mac OS; Amazon Web Services (AWS),Sublime Text; IPython / Jupyter,Turn on some music,Waterfall; Agile; Scrum; Pair,Git,A few times a week,Somewhat agree,Somewhat agree,Somewhat agree,Disagree,,Somewhat agree,Disagree,Hours worked; Commit frequency; Release freque...,Not very satisfied,Somewhat satisfied,Somewhat satisfied,Somewhat satisfied,Somewhat satisfied,Not very satisfied,A lot of influence,Not much influence,Some influence,Some influence,Some influence,Some influence,A lot of influence,Some influence,Some influence,Not much influence,Some influence,"I have a login for Stack Overflow, but haven't...",7.0,Desktop,Several times,Once or twice,Once or twice,Once or twice,Haven't done at all,Once or twice,Once or twice,Haven't done at all,Somewhat agree,Disagree,Disagree,Somewhat agree,Agree,Somewhat agree,Disagree,Disagree,Male,"Some college/university study, no bachelor's d...",Middle Eastern,Agree,Somewhat agree,Disagree,Agree,,


In [7]:
# Random sample for 206
survey_2016.sample(2)

Unnamed: 0,collector,country,un_subregion,so_region,age_range,age_midpoint,gender,self_identification,occupation,occupation_group,experience_range,experience_midpoint,salary_range,salary_midpoint,big_mac_index,tech_do,tech_want,aliens,programming_ability,employment_status,industry,company_size_range,team_size_range,women_on_team,remote,job_satisfaction,job_discovery,dev_environment,commit_frequency,hobby,dogs_vs_cats,desktop_os,unit_testing,rep_range,visit_frequency,why_learn_new_tech,education,open_to_new_job,new_job_value,job_search_annoyance,interview_likelihood,how_to_improve_interview_process,star_wars_vs_star_trek,agree_tech,agree_notice,agree_problemsolving,agree_diversity,agree_adblocker,agree_alcohol,agree_loveboss,agree_nightcode,agree_legacy,agree_mars,important_variety,important_control,important_sameend,important_newtech,important_buildnew,important_buildexisting,important_promotion,important_companymission,important_wfh,important_ownoffice,developer_challenges,why_stack_overflow
25553,Meta Stack Overflow Post,Germany,Western Europe,Western Europe,25-29,27.0,Male,,,,,,,,3.86,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
46726,Miscellaneous,United States,North America,North America,30-34,32.0,Male,Developer; Programmer,Full-stack web developer,Full-stack web developer,6 - 10 years,8.0,Rather not say,,4.93,iOS; JavaScript; PHP; Ruby; SQL; WordPress,,No,7.0,Employed full-time,Web Services,20-99 employees,5-9 people,0.0,I rarely work remotely,I'm somewhat satisfied with my job,A friend referred me,Coda; Xcode,Multiple times a day,2-5 hours per week,Dogs,Mac OS X,Yes,I don't have an account,Once a week,I want to be a better developer,I'm self-taught; On-the-job training,"I'm not actively looking, but I am open to new...",Salary; Health insurance; Industry; Company si...,The interview process,50%,Offer remote interviews (e.g. via video confer...,Star Wars,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,Agree somewhat,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,Poor scheduling; Unrealistic expectations; Cor...,To give help to others; To communicate with ot...


Now we have a better sense of what the data looks like, so we can proceed to picking the columns that we will need for the analysis, given the questions we outlines in [Section 1.](#business). These are the following:

The columns that indicate what language respondants want to learn are the following:
> Survey 2020: LanguageWorkedWith \
> Survey 2019: LanguageWorkedWith \
> Survey 2018: LanguageWorkedWith \
> Survey 2017: HaveWorkedLanguage \
> Survey 2016: tech_do

The columns that indicate how satisfied a respondant is with their job are:
> Survey 2020: JobSat \
> Survey 2019: JobSat \
> Survey 2018: JobSatisfaction \
> Survey 2017: JobSatisfaction \
> Survey 2016: job_satisfaction

The columns that indicate education status are the following:
> Survey 2020: EdLevel \
> Survey 2019: EdLevel \
> Survey 2018: FormalEducation \
> Survey 2017: FormalEducation \
> Survey 2016: education

The columns that indicate where the respondant lives are:
> Survey 2020: Country \
> Survey 2019: Country \
> Survey 2018: Country \
> Survey 2017: Country \
> Survey 2016: country

The columns that indicate what is the respondant's developer status are:
> Survey 2020: Gender \
> Survey 2019: Gender \
> Survey 2018: Gender \
> Survey 2017: Gender \
> Survey 2016: gender

The columns that indicate what is the respondant's employment status are:
> Survey 2020: Employment \
> Survey 2019: Employment \
> Survey 2018: Employment \
> Survey 2017: EmploymentStatus \
> Survey 2016: employment_status


Lastly, we can take a loop at the shapefiles we imported:

In [8]:
# Head of shapefiles
map_df.head()

Unnamed: 0,OBJECTID,CNTRY_NAME,CNTRY_CODE,BPL_CODE,geometry
0,1,Algeria,12,13010.0,"MULTIPOLYGON (((-2.05592 35.07370, -2.05675 35..."
1,2,Angola,24,12010.0,"MULTIPOLYGON (((12.79760 -4.41685, 12.79875 -4..."
2,3,In dispute South Sudan/Sudan,9999,99999.0,"POLYGON ((28.08408 9.34722, 28.03889 9.34722, ..."
3,4,Benin,204,15010.0,"MULTIPOLYGON (((1.93753 6.30122, 1.93422 6.299..."
4,5,Botswana,72,14010.0,"POLYGON ((25.16312 -17.77816, 25.16383 -17.778..."


So the countries are in the variable CNTRY_NAME. We will need to match these with our data and for that we will have to harmonize all the country names in [Section 3.](#prepare)

Given all of the above we can proceed to prepare our data!

<a name="prepare"></a>
## 3. Prepare data

Since we already know which columns we will need in order to answer our questions we can first start by droping columns that are not relevant to us:

In [9]:
# Put relevant variables in list
keep_2020 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2019 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2018 = ['LanguageWorkedWith', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'Employment']
keep_2017 = ['HaveWorkedLanguage', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'EmploymentStatus']
keep_2016 = ['tech_do', 'job_satisfaction', 'education', 'country', 'gender', 'employment_status']

# Keep only relevant variables
survey_2020 = survey_2020[keep_2020]
survey_2019 = survey_2019[keep_2019]
survey_2018 = survey_2018[keep_2018]
survey_2017 = survey_2017[keep_2017]
survey_2016 = survey_2016[keep_2016]

Nice! Now we can rename the columns so that all dataframes have the same names for variables

In [10]:
# Rename columns
survey_2020.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2019.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2018.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2017.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'EmploymentStatus': 'employment'}, inplace = True)
survey_2016.rename(columns={'tech_do': 'languages', 'job_satisfaction': 'job_satisfaction', 
                           'education': 'education', 'country': 'country', 'gender': 'gender', 
                           'employment_status': 'employment'}, inplace = True);

Now we need to harmonize the answers to different questions for all survey years in order to merge them and have a complete data set. Let's with some an easy one and look at the category for gender in each year.

In [11]:
# Print unique gender categories in 2020
survey_2020['gender'].unique()

array(['Man', nan, 'Woman',
       'Man;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man'], dtype=object)

In [12]:
# Print unique gender categories in 2019
survey_2019['gender'].unique()

array(['Man', nan, 'Woman',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man',
       'Man;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [13]:
# Print unique gender categories in 2018
survey_2018['gender'].unique()

array(['Male', nan, 'Female',
       'Female;Male;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Male',
       'Male;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming', 'Transgender',
       'Female;Transgender',
       'Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Non-binary, genderqueer, or gender non-conforming',
       'Female;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender', 'Female;Male;Transgender',
       'Female;Male;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [14]:
# Print unique gender categories in 2017
survey_2017['gender'].unique()

array(['Male', nan, 'Female', 'Gender non-conforming', 'Other',
       'Male; Gender non-conforming', 'Female; Transgender',
       'Male; Female', 'Male; Other', 'Transgender',
       'Transgender; Gender non-conforming',
       'Female; Gender non-conforming',
       'Male; Female; Transgender; Gender non-conforming; Other',
       'Male; Female; Transgender', 'Male; Female; Other',
       'Male; Female; Transgender; Gender non-conforming',
       'Male; Transgender', 'Female; Transgender; Gender non-conforming',
       'Gender non-conforming; Other',
       'Male; Female; Gender non-conforming', 'Female; Other',
       'Male; Transgender; Gender non-conforming', 'Transgender; Other',
       'Male; Gender non-conforming; Other',
       'Female; Gender non-conforming; Other',
       'Male; Female; Gender non-conforming; Other',
       'Female; Transgender; Other',
       'Female; Transgender; Gender non-conforming; Other',
       'Male; Transgender; Other', 'Male; Female; Transgender;

In [15]:
# Print unique gender categories in 2016
survey_2016['gender'].unique()

array(['Male', nan, 'Female', 'Prefer not to disclose', 'Other'],
      dtype=object)

Given what we see above, let's cluster all in the following four categories: female, male, other, nan. We can define a function to assing the value of Male, Female, Other or nan.

In [18]:
# Define function to harmonize gender
def harmonize_gender(df_raw):
    '''This function unifies all gender categories into 
    four: Male, Female, Other and nan. It also creates 
    binary variables for each of the above categories.
    '''
    # Copy df_raw
    df = df_raw.copy()
    # Binary variable for categories
    df['gender_male'] = 0
    df['gender_female'] = 0
    df['gender_other'] = 0
    df['gender_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define gender
        gender = str(df.loc[i, 'gender']).lower()
        # Value if male or man
        if gender == 'male' or gender == 'man':
            df.loc[i, 'gender'] = 'Male'
            df.loc[i, 'gender_male'] = 1
        # Value if female or woman
        elif gender == 'female' or gender == 'woman':
            df.loc[i, 'gender'] = 'Female'
            df.loc[i, 'gender_female'] = 1
        # Assign null values
        elif gender == 'nan':
            df.loc[i, 'gender'] = np.nan
            df.loc[i, 'gender_null'] = 1
        # Other categories lumped into other
        else:
            df.loc[i, 'gender'] = 'Other'
            df.loc[i, 'gender_other'] = 1
    # Return harmonized dataframe
    return(df)

# Apply gender harmonizer
survey_2020 = harmonize_gender(survey_2020)
survey_2019 = harmonize_gender(survey_2019)
survey_2018 = harmonize_gender(survey_2018)
survey_2017 = harmonize_gender(survey_2017)
survey_2016 = harmonize_gender(survey_2016)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88883.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=98855.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=51392.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56030.0), HTML(value='')))




In [19]:
survey_2020.head()

Unnamed: 0,languages,job_satisfaction,education,country,gender,employment,gender_male,gender_female,gender_other,gender_null
0,C#;HTML/CSS;JavaScript,Slightly satisfied,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Germany,Male,"Independent contractor, freelancer, or self-em...",1,0,0,0
1,JavaScript;Swift,Very dissatisfied,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",United Kingdom,,Employed full-time,0,0,0,1
2,Objective-C;Python;Swift,,,Russian Federation,,,0,0,0,1
3,,Slightly dissatisfied,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Albania,Male,,1,0,0,0
4,HTML/CSS;Ruby;SQL,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",United States,Male,Employed full-time,1,0,0,0


Similarly, for job satisfaction we can look at the possible values

In [20]:
# Print unique job satisfaction categories in 2020
survey_2020['job_satisfaction'].unique()

array(['Slightly satisfied', 'Very dissatisfied', nan,
       'Slightly dissatisfied', 'Very satisfied',
       'Neither satisfied nor dissatisfied'], dtype=object)

In [21]:
# Print unique job satisfaction categories in 2019
survey_2019['job_satisfaction'].unique()

array([nan, 'Slightly satisfied', 'Slightly dissatisfied',
       'Neither satisfied nor dissatisfied', 'Very satisfied',
       'Very dissatisfied'], dtype=object)

In [22]:
# Print unique job satisfaction categories in 2018
survey_2018['job_satisfaction'].unique()

array(['Extremely satisfied', 'Moderately dissatisfied',
       'Moderately satisfied', 'Neither satisfied nor dissatisfied',
       'Slightly satisfied', nan, 'Slightly dissatisfied',
       'Extremely dissatisfied'], dtype=object)

In [23]:
# Print unique job satisfaction categories in 2017
survey_2017['job_satisfaction'].unique()

array([nan,  9.,  3.,  8.,  6.,  7.,  5.,  4., 10.,  2.,  0.,  1.])

In [24]:
# Print unique job satisfaction categories in 2016
survey_2016['job_satisfaction'].unique()

array([nan, 'I love my job', "I don't have a job",
       "I'm somewhat satisfied with my job",
       "I'm somewhat dissatisfied with my job",
       "I'm neither satisfied nor dissatisfied", 'Other (please specify)',
       'I hate my job'], dtype=object)

We will try to lump all categories into six categories: Very satisfied, satisfied, Neither, Dissatisfied, Very Dissatisfied and nan. We will take a similar approach to gender and define a function to do this.

In [25]:
# Define function to harmonize job satisfaction
def harmonize_jobsatisfaction(df_raw):
    '''This function harmonizes all the job
    satisfaction responses into: Very satisfied,
    Satisfied, Neither, Dissatisfied, Very dissatisfied
    and nan. It also creates binary variables for each
    of the above categories.'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    v_satisfied = ['very satisfied', 'extremely satisfied', 'i love my job', '10', '9']
    satisfied = ['slightly satisfied', 'moderately satisfied', 'i\'m somewhat satisfied with my job', '8', '7']
    neither = ['neither satisfied nor dissatisfied', 'i\'m neither satisfied nor dissatisfied', '6', '5', '4']
    dissatisfied = ['slightly dissatisfied', 'moderately dissatisfied', 'i\'m somewhat dissatisfied with my job', '3', '2']
    v_dissatisfied = ['very dissatisfied', 'extremely dissatisfied', 'i hate my job', '1', '0']
    # New binary variables
    df['jobsat_v_satisfied'] = 0
    df['jobsat_satisfied'] = 0
    df['jobsat_neither'] = 0
    df['jobsat_disssatisfied'] = 0
    df['jobsat_v_disssatisfied'] = 0
    df['jobsat_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define job satisfaction
        job_satisfac = str(df.loc[i, 'job_satisfaction']).lower()
        # Value if very satisfied and assign binary variable
        if job_satisfac in v_satisfied:
            df.loc[i, 'job_satisfaction'] = 'Very satisfied'
            df.loc[i,'jobsat_v_satisfied'] = 1
        # Value if satisfied and assign binary variable
        elif job_satisfac in satisfied:
            df.loc[i, 'job_satisfaction'] = 'Satisfied'
            df.loc[i, 'jobsat_satisfied'] = 1
        # Value if neither and assign binary variable
        elif job_satisfac in neither:
            df.loc[i, 'job_satisfaction'] = 'Neither'
            df.loc[i, 'jobsat_neither'] = 1
        # Value if dissatisfied and assign binary variable
        elif job_satisfac in dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Dissatisfied'
            df.loc[i, 'jobsat_dissatisfied'] = 1
        # Value if very dissatisfied and assign binary variable
        elif job_satisfac in v_dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Very Dissatisfied'
            df.loc[i, 'jobsat_v_dissatisfied'] = 1
        # Other categories become np.nan values
        else:
            df.loc[i, 'job_satisfaction'] = np.nan
            df.loc[i, 'jobsat_null'] = 1
    # Return harmonized dataframe
    return(df)
    
    
# Apply job satisfaction harmonizer
survey_2020 = harmonize_jobsatisfaction(survey_2020)
survey_2019 = harmonize_jobsatisfaction(survey_2019)
survey_2018 = harmonize_jobsatisfaction(survey_2018)
survey_2017 = harmonize_jobsatisfaction(survey_2017)
survey_2016 = harmonize_jobsatisfaction(survey_2016)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88883.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=98855.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=51392.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56030.0), HTML(value='')))




Next, let's look at the employment variables and how they are layed out

In [26]:
survey_2020 = harmonize_jobsatisfaction(survey_2020)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




In [27]:
pd.options.display.max_columns = None
survey_2020.head()

Unnamed: 0,languages,job_satisfaction,education,country,gender,employment,gender_male,gender_female,gender_other,gender_null,jobsat_v_satisfied,jobsat_satisfied,jobsat_neither,jobsat_disssatisfied,jobsat_v_disssatisfied,jobsat_null,jobsat_v_dissatisfied,jobsat_dissatisfied
0,C#;HTML/CSS;JavaScript,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Germany,Male,"Independent contractor, freelancer, or self-em...",1,0,0,0,1,0,0,0,0,1,1,1
1,JavaScript;Swift,Very Dissatisfied,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",United Kingdom,,Employed full-time,0,0,0,1,1,0,0,0,0,1,1,1
2,Objective-C;Python;Swift,,,Russian Federation,,,0,0,0,1,1,0,0,0,0,1,1,1
3,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Albania,Male,,1,0,0,0,1,0,0,0,0,1,1,1
4,HTML/CSS;Ruby;SQL,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",United States,Male,Employed full-time,1,0,0,0,1,0,0,0,0,1,1,1


In [28]:
# Print unique employment categories in 2020
survey_2020['employment'].unique()

array(['Independent contractor, freelancer, or self-employed',
       'Employed full-time', nan, 'Student',
       'Not employed, but looking for work', 'Employed part-time',
       'Retired', 'Not employed, and not looking for work'], dtype=object)

In [29]:
# Print unique employment categories in 2019
survey_2019['employment'].unique()

array(['Not employed, and not looking for work',
       'Not employed, but looking for work', 'Employed full-time',
       'Independent contractor, freelancer, or self-employed', nan,
       'Employed part-time', 'Retired'], dtype=object)

In [30]:
# Print unique employment categories in 2018
survey_2018['employment'].unique()

array(['Employed part-time', 'Employed full-time',
       'Independent contractor, freelancer, or self-employed',
       'Not employed, and not looking for work',
       'Not employed, but looking for work', nan, 'Retired'], dtype=object)

In [31]:
# Print unique employment categories in 2017
survey_2017['employment'].unique()

array(['Not employed, and not looking for work', 'Employed part-time',
       'Employed full-time',
       'Independent contractor, freelancer, or self-employed',
       'Not employed, but looking for work', 'I prefer not to say',
       'Retired'], dtype=object)

In [32]:
# Print unique employment categories in 2016
survey_2016['employment'].unique()

array([nan, 'Employed full-time', 'Freelance / Contractor',
       'Self-employed', "I'm a student", 'Unemployed',
       'Prefer not to disclose', 'Employed part-time',
       'Other (please specify)', 'Retired'], dtype=object)

Now we have to deal with the responses for employment status. This one seems a bit trickier as answer categories have changed over the years. With that in mind, let's create a function to harmonize these categories into the folowing: Full-time, Part-time, Self-employed, Not emplyed, Other and nan.

In [None]:
# Define function to harmonize employment categories
def harmonize_employment(df_raw):
    '''This function harmonizes all employment responses
    into: Full-time, Part-time, Self-employed, Not employed,
    Other and nan. It also creates binary variables for each
    of the above categories.'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    full_time = ['employed full-time']
    part_time = ['employed part-time']
    self_empl = ['independent contractor, freelancer, or self-employed', 'freelance / contractor', 'self-employed']
    not_employed = ['not employed, but looking for work', 'not employed, and not looking for work', 'unemployed']
    other = ['student', 'i\'m a student', 'retired', 'i prefer not to say', 'prefer not to disclose', 'other (please specify)']
    # New binary variables
    df['employment_full_time'] = 0
    df['employment_part_time'] = 0
    df['employment_self_empl'] = 0
    df['employment_not_empl'] = 0
    df['employment_other'] = 0
    df['employment_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define employment
        employment = str(df.loc[i, 'employment']).lower()
        # Value if full-time and assign binary variable
        if employment in full_time:
            df.loc[i, 'employment'] = 'Full-time'
            df.loc[i, 'employment_full_time'] = 1
        # Value if part-time and assign binary variable
        elif employment in part_time:
            df.loc[i, 'employment'] = 'Part-time'
            df.loc[i, 'employment_part_time'] = 1
        # Value if self-employed and assign binary variable
        elif employment in self_empl:
            df.loc[i, 'employment'] = 'Self-employed'
            df.loc[i, 'employment_self_empl'] = 1
        # Value if not employed and assign binary variable
        elif employment in not_employed:
            df.loc[i, 'employment'] = 'Not employed'
            df.loc[i, 'employment_not_empl'] = 1
        # Value if other and assign binary variable
        elif employment in other:
            df.loc[i, 'employment'] = 'Other'
            df.loc[i, 'employment_other'] = 1
        # Other categories become np.nan values
        else:
            df.loc[i, 'employment'] = np.nan
            df.loc[i, 'employment_null'] = 1
    # Return harmonized dataframe
    return(df)

# Apply employment harmonizer
survey_2020 = harmonize_employment(survey_2020)
survey_2019 = harmonize_employment(survey_2019)
survey_2018 = harmonize_employment(survey_2018)
survey_2017 = harmonize_employment(survey_2017)
survey_2016 = harmonize_employment(survey_2016)

HBox(children=(FloatProgress(value=0.0, max=64461.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=88883.0), HTML(value='')))

Now let's look at education variables and their respective values.

In [None]:
# Print unique education categories in 2020
survey_2020['education'].unique()

In [None]:
# Print unique education categories in 2019
survey_2019['education'].unique()

In [None]:
# Print unique education categories in 2018
survey_2018['education'].unique()

In [None]:
# Print unique education categories in 2017
survey_2017['education'].unique()

In [None]:
# Print unique education categories in 2016
survey_2016['education'].unique()

The answers for 2016 look very different from the previous years. This is probably because respondants were allowed to tick more than one box.

Lastly, in order to be able to match the survey data we have with the geodata we imported we need to harminze the name of the countries. In order to do that, we can use the [country_converter library](https://pypi.org/project/country-converter/). We define the following function and apply it to the country values

In [None]:
# Define function to convert country name into ISO3
def country_iso3(df_raw, df_type = 'survey'):
    '''This function createts ISO3 country values 
    column'''
    # Check if df_type valid
    if df_type != 'survey' or df_survey != 'map':
        ## Exception
    # Copy df_raw
    df = df_raw.copy()
    # If survey is passed
    if df_type = 'survey':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'country'])
            # Create to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, 'ISO3')
    # If map is passed
    elif df_type = 'map':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'CNTRY_NAME'])
            # Convert to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, 'ISO3')
    # Return dataframe
    return(df)

# Convert ISO3 
        


With that in hands we can add a variable at the end of each dataset to mark the year it represents and merge them

In [None]:
# Add year variable to dataframes
survey_2020['year'] = 2020
survey_2019['year'] = 2019
survey_2018['year'] = 2018
survey_2017['year'] = 2017
survey_2016['year'] = 2016

# Merge datasets into one
data = [survey_2020, survey_2019, survey_2018, survey_2017, survey_2016]
survey = pd.concat(data)

Another thing that we need to prepare in order to later match the surveys and the shapefile datasets is the country names. In order to have exact matches between the country names in the survey and shapefile data we can convert all the names using a regex to an ISO3 name and then match them. Luckily, the packate country-converter does that for us.

In [None]:
# List of survey files
dfs = [survey_2020, survey_2019, survey_2018, survey_2017, survey_2016]



<a name="model"></a>
## 4. Data modeling

Text text

<a name="eval"></a>
## 5. Evaluate the results

Text text

<a name="deploy"></a>
## 6. Deploy

Text text

In [None]:
os.getcwd()

In [None]:
import geopandas as gpd

In [None]:
map_df = gpd.read_file('IPUMSI_world_release2017/world_countries_2017.shp')

In [None]:
ax = map_df.plot()
ax.axis('off');

In [None]:
map_df.head()

In [None]:
countries = map_df['CNTRY_NAME'].unique().tolist()

In [None]:
'Bahamas' in countries

In [None]:
countries2 = survey_2020['Country'].unique().tolist()

In [None]:
import country_converter as coco

iso_lst1 = []
iso_lst2 = []

for country in countries:
    iso1 = coco.convert(names=country, to='ISO3')
    iso_lst1.append(iso1)

for country2 in countries2:
    iso2 = coco.convert(names=country2, to='ISO3')
    iso_lst2.append(iso2)


In [None]:
for i in iso_lst2:
    print(i, i in iso_lst1)

In [None]:
'US' in iso_lst1

In [None]:
coco.convert(names='United States of America', to='ISO3')

In [None]:
'USA' in iso_lst2

In [None]:
for i in survey_2020.index:
    survey_2020.loc[i, 'Country'] = coco.convert(names = str(survey_2020.loc[i, 'Country']), to = 'ISO3')

In [None]:
survey_2020.head()