# Title here

Description here

## Table of contents
- [1. Business undestanding](#business)
- [2. Data understanding](#data)
    - [2.1. Gathering data](#gather)
    - [2.2. Assessing data](#assess)
- [3. Prepare data](#prepare)
- [4. Data modeling](#model)
- [5. Evaluate the results](#eval)
- [6. Deploy](#deploy)

<a name="business"></a>
## 1. Business understanding

In this notebook we will try to address the following questions using data from [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey).

> What are the most popular programming languages over the past five years?\
> What countries have more female respondants?\
> Are male respondants happier than female respondants?\
> What countries have the highest job satisfaction rates?

The following sections serve as a guide in order to understand the data and what needs to be done in order to answer the questions above.

<a name="data"></a>
## 2. Data understanding

We begin our work by downloading the data that we will need in order to address the questions layed out in [Section 1.](#business). We will then proceed to taking a look at our data to get a sense of what needs to be changed later on

<a name="gather"></a>
   

<a name="gather"></a>
### 2.1. Gathering data

First, we need to download all the necessary data. In order to do so, we can run the line below to download all Stack Overflow surveys for all years:

In [None]:
# Download survey data
%run -i '../download/download.py'

# Download shape files
%run -i '../download/shape.py'

These are all the surveys since 2011. We will only use the ones from the last five years. One of the reasons for doing so is that the structure of the survey changed and similar questions might not be comparable anymore. Next, in preparation for the next sections we can import the relevant libraries.

In [43]:
# Import libraries
import geopandas as gpd
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandarallel import pandarallel
from pycountry import countries
from tqdm.auto import tqdm
%matplotlib inline

<a name="assess"></a>
### 2.2. Assessing data

Now that we have downloaded all the datasets let's start by reading the csvs from the past five years. In doing so, I am ignoring the first column as it serves as an ordered identifier for the respondants.

In [2]:
# Import survey data and skip first column
import warnings; warnings.simplefilter('ignore')
survey_2016 = pd.read_csv("../data/survey/survey_2016.csv").iloc[:, 1:]
survey_2017 = pd.read_csv("../data/survey/survey_2017.csv").iloc[:, 1:]
survey_2018 = pd.read_csv("../data/survey/survey_2018.csv").iloc[:, 1:]
survey_2019 = pd.read_csv("../data/survey/survey_2019.csv").iloc[:, 1:]
survey_2020 = pd.read_csv("../data/survey/survey_2020.csv").iloc[:, 1:]

# Import shapefile with geopandas
map_df = gpd.read_file("../data/shapefile/world_countries_2017.shp")

Great! Now we can quickly look at what these datasets look like. I will do that by picking two random samples from the survey.

In [4]:
# Show dataframe for two random samples for 2020
pd.options.display.max_columns = None # to show all columns
survey_2020.sample(2)

Unnamed: 0,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,CurrencySymbol,DatabaseDesireNextYear,DatabaseWorkedWith,DevType,EdLevel,Employment,Ethnicity,Gender,JobFactors,JobSat,JobSeek,LanguageDesireNextYear,LanguageWorkedWith,MiscTechDesireNextYear,MiscTechWorkedWith,NEWCollabToolsDesireNextYear,NEWCollabToolsWorkedWith,NEWDevOps,NEWDevOpsImpt,NEWEdImpt,NEWJobHunt,NEWJobHuntResearch,NEWLearn,NEWOffTopic,NEWOnboardGood,NEWOtherComms,NEWOvertime,NEWPurchaseResearch,NEWPurpleLink,NEWSOSites,NEWStuck,OpSys,OrgSize,PlatformDesireNextYear,PlatformWorkedWith,PurchaseWhat,Sexuality,SOAccount,SOComm,SOPartFreq,SOVisitFreq,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
3631,I am a developer by profession,Yes,27.0,15,Yearly,54000.0,69800.0,United Kingdom,Pound sterling,GBP,PostgreSQL,PostgreSQL,"Developer, full-stack","Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Employed full-time,White or of European descent,Man,Specific department or team I’d be working on;...,Very satisfied,I am not interested in new job opportunities,Python;SQL;TypeScript,JavaScript;Python;SQL;TypeScript,Teraform,Ansible;Node.js;Teraform,"Slack;Google Suite (Docs, Meet, etc)","Github;Slack;Trello;Google Suite (Docs, Meet, ...",No,Somewhat important,Fairly important,Better compensation;Looking to relocate,"Read company media, such as employee blogs or ...",Every few months,Yes,Yes,Yes,Occasionally: 1-2 days per quarter but less th...,Start a free trial;Ask developers I know/work ...,Amused,Stack Overflow (public Q&A for anyone who code...,Call a coworker or friend;Visit Stack Overflow...,MacOS,10 to 19 employees,AWS;Docker;Linux;MacOS,AWS;Docker;Linux;MacOS,I have some influence,Straight / Heterosexual,Yes,"Yes, somewhat",Less than once per month or monthly,Daily or almost daily,Neither easy nor difficult,Too long,No,"Computer science, computer engineering, or sof...",Flask;React.js,Angular.js;Flask;jQuery;React.js,Just as welcome now as I felt last year,40.0,12,4.0
22065,I am a student who is learning to code,Yes,20.0,15,,,,Netherlands,,,MongoDB;SQLite,Firebase;MongoDB;MySQL;SQLite,,Some college/university study without earning ...,Student,White or of European descent,Man,"Flex time or a flexible schedule;Languages, fr...",,"I’m not actively looking, but I am open to new...",C#;C++;HTML/CSS;JavaScript;TypeScript,C;C#;C++;HTML/CSS;Java;JavaScript;PHP;Swift;Ty...,Node.js,Node.js,"Jira;Google Suite (Docs, Meet, etc)","Github;Google Suite (Docs, Meet, etc)",,,,,,Every few months,Not sure,,No,,Start a free trial;Ask developers I know/work ...,"Hello, old friend",Stack Overflow (public Q&A for anyone who code...,Visit Stack Overflow;Go for a walk or other ph...,Windows,,Android;iOS,Android;iOS;Microsoft Azure;Windows,,Straight / Heterosexual,Yes,"No, not at all",I have never participated in Q&A on Stack Over...,Multiple times per day,Easy,Appropriate in length,No,"Computer science, computer engineering, or sof...",Angular;Express,Angular;Express,Just as welcome now as I felt last year,,4,


And for the remaining years we see:

In [5]:
# Random sample for 2019
survey_2019.sample(2)

Unnamed: 0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
27834,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Italy,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",Participated in a full-time developer training...,20 to 99 employees,"Developer, full-stack",10,16,4,Very satisfied,Slightly dissatisfied,Somewhat confident,Yes,No,I am actively looking for a job,Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,"Languages, frameworks, and other technologies ...","Something else changed (education, award, medi...",EUR,European Euro,1350.0,Monthly,18564.0,40.0,There's no schedule or spec; I work on what se...,Lack of support from management;Toxic work env...,Less than once per month / Never,Office,A little above average,"Yes, because I was told to do so",8.0,"No, but I think we should",Developers and management have nearly equal in...,I have some influence,C#;HTML/CSS;JavaScript;SQL;TypeScript,C#;JavaScript;SQL;TypeScript,DynamoDB;Elasticsearch;MongoDB;MySQL;PostgreSQL,DynamoDB;Elasticsearch;MongoDB;MySQL;Oracle;Po...,AWS;Microsoft Azure;Slack,AWS;Slack,Angular/Angular.js;ASP.NET,Angular/Angular.js;ASP.NET,.NET;.NET Core;Node.js,.NET;.NET Core;Node.js,Notepad++;PHPStorm;Visual Studio;Visual Studio...,Windows,Development;Testing;Production,Non-currency applications of blockchain,Useful across many domains and could change ma...,No,Also Yes,Yes,WhatsApp,In real life (in person),Username,2014.0,Daily or almost daily,Find answers to specific questions,6-10 times per week,Stack Overflow was much faster,31-60 minutes,No,,Yes,"No, and I don't know what those are","No, not really",Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,32.0,Woman,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
54329,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,Israel,No,Some college/university study without earning ...,"Another engineering discipline (ex. civil, ele...",Taken an online course in programming or softw...,10 to 19 employees,"Developer, back-end;Developer, full-stack;Deve...",36,10,22,Slightly satisfied,Very satisfied,Somewhat confident,No,No,"I’m not actively looking, but I am open to new...",Less than a year ago,"Write any code;Write code by hand (e.g., on a ...",No,Industry that I'd be working in;Specific depar...,I was preparing for a job search,ILS,Israeli new shekel,21000.0,Monthly,69276.0,36.0,There is a schedule and/or spec (made by me or...,Distracting work environment;Meetings;Toxic wo...,"Less than half the time, but at least one day ...",Office,Far above average,No,,"No, but I think we should",Not sure,I have some influence,Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...,C;C++;Dart;Java;Kotlin;Python;Rust;TypeScript,MongoDB;MySQL,MongoDB;MySQL;PostgreSQL,Android;AWS;Docker;Linux,Android;Arduino;Docker;Linux,Flask;jQuery;React.js,Flask;React.js;Vue.js,React Native,Chef;Flutter;Puppet,Android Studio;Eclipse;Notepad++;PyCharm;Vim,Windows,Development;Testing;Production,Not at all,A passing fad,No,Yes,No,WhatsApp,Online,Username,,Multiple times per day,Find answers to specific questions,3-5 times per week,Stack Overflow was much faster,60+ minutes,Yes,Less than once per month or monthly,Yes,"No, I've heard of them, but I am not part of a...","Yes, somewhat",Just as welcome now as I felt last year,,46.0,Man,No,Straight / Heterosexual,Middle Eastern;White or of European descent,Yes,Too long,Easy


In [6]:
# Random sample for 2018
survey_2018.sample(2)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,AssessJob1,AssessJob2,AssessJob3,AssessJob4,AssessJob5,AssessJob6,AssessJob7,AssessJob8,AssessJob9,AssessJob10,AssessBenefits1,AssessBenefits2,AssessBenefits3,AssessBenefits4,AssessBenefits5,AssessBenefits6,AssessBenefits7,AssessBenefits8,AssessBenefits9,AssessBenefits10,AssessBenefits11,JobContactPriorities1,JobContactPriorities2,JobContactPriorities3,JobContactPriorities4,JobContactPriorities5,JobEmailPriorities1,JobEmailPriorities2,JobEmailPriorities3,JobEmailPriorities4,JobEmailPriorities5,JobEmailPriorities6,JobEmailPriorities7,UpdateCV,Currency,Salary,SalaryType,ConvertedSalary,CurrencySymbol,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AdsPriorities1,AdsPriorities2,AdsPriorities3,AdsPriorities4,AdsPriorities5,AdsPriorities6,AdsPriorities7,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
9159,Yes,Yes,Russian Federation,"Yes, part-time",Employed full-time,Some college/university study without earning ...,"A natural science (ex. biology, chemistry, phy...","1,000 to 4,999 employees",Mobile developer,9-11 years,6-8 years,Moderately satisfied,Moderately satisfied,Working as a founder or co-founder of my own c...,"I’m not actively looking, but I am open to new...",More than 4 years ago,5.0,10.0,6.0,2.0,3.0,8.0,7.0,1.0,9.0,4.0,1.0,4.0,2.0,9.0,6.0,11.0,5.0,8.0,10.0,7.0,3.0,1.0,2.0,4.0,5.0,3.0,5.0,4.0,7.0,2.0,1.0,6.0,3.0,I had a negative experience or interaction at ...,Russian rubles (₽),94000,Monthly,19956.0,RUB,"Confluence;Jira;Other chat system (IRC, propri...",One to three months,"Taught yourself a new language, framework, or ...",The official documentation and/or standards fo...,,,Neither Agree nor Disagree,Strongly disagree,Disagree,C++;Swift,Swift,,,iOS,Apple Watch or Apple TV;iOS;Mac OS,,,Notepad++;Xcode,MacOS,2,Agile;Scrum,Git,Once a day,No,,,Somewhat agree,Somewhat agree,Neither agree nor disagree,Clicked on an online advertisement;Stopped goi...,1.0,3.0,4.0,5.0,6.0,2.0,7.0,,Increasing automation of jobs,A governmental or other regulatory body,I'm excited about the possibilities more than ...,Depends on what it is,Depends on what it is,The person who came up with the idea,Unsure / I don't know,10 (Very Likely),Daily or almost daily,Yes,Less than once per month or monthly,Yes,"No, I have one but it's out of date",7,Yes,Very interested,Somewhat interested,Very interested,Somewhat interested,Somewhat interested,I do not have a set schedule,9 - 12 hours,Less than 30 minutes,1 - 2 times per week,,I don't typically exercise,Male,Straight or heterosexual,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",White or of European descent,25 - 34 years old,No,,The survey was too long,Somewhat easy
77101,Yes,No,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",,Back-end developer;Desktop or enterprise appli...,15-17 years,9-11 years,Moderately satisfied,Extremely satisfied,Working in a different or more specialized tec...,I am not interested in new job opportunities,Less than a year ago,10.0,8.0,5.0,2.0,1.0,9.0,4.0,3.0,6.0,7.0,1.0,4.0,3.0,10.0,9.0,2.0,5.0,6.0,11.0,8.0,7.0,,,,,,,,,,,,,My job status or other personal status changed,U.S. dollars ($),160000,Yearly,160000.0,USD,"Office / productivity suite (Microsoft Office,...",One to three months,Completed an industry certification program (e...,The official documentation and/or standards fo...,,To improve my general technical skills or prog...,Agree,Neither Agree nor Disagree,Strongly disagree,C#;Go;JavaScript;TypeScript;HTML;CSS,C#;Go;JavaScript;TypeScript;HTML;CSS,MongoDB;Redis;SQL Server,MongoDB;Redis;SQL Server;Neo4j,AWS;Heroku,Arduino;AWS;Azure;Serverless,Angular;Node.js;React,.NET Core;Node.js;React;TensorFlow,Notepad++;Visual Studio;Visual Studio Code,Windows,2,Agile;Formal standard such as ISO 9001 or IEEE...,Git;Team Foundation Version Control,Multiple times per day,Yes,Yes,The website I was visiting forced me to disabl...,Somewhat agree,Somewhat disagree,Somewhat disagree,Clicked on an online advertisement;Stopped goi...,1.0,4.0,2.0,6.0,3.0,7.0,5.0,Algorithms making important decisions,Increasing automation of jobs,The developers or the people creating the AI,I'm excited about the possibilities more than ...,Yes,Depends on what it is,Upper management at the company/organization,Yes,10 (Very Likely),A few times per week,Yes,A few times per week,Yes,Yes,10 (Very Likely),Yes,A little bit interested,Not at all interested,Somewhat interested,Extremely interested,A little bit interested,Between 8:01 - 9:00 AM,Over 12 hours,Less than 30 minutes,Never,,1 - 2 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,No,No,The survey was an appropriate length,Very easy


In [7]:
# Random sample for 2017
survey_2017.sample(2)

Unnamed: 0,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,CompanyType,YearsProgram,YearsCodedJob,YearsCodedJobPast,DeveloperType,WebDeveloperType,MobileDeveloperType,NonDeveloperType,CareerSatisfaction,JobSatisfaction,ExCoderReturn,ExCoderNotForMe,ExCoderBalance,ExCoder10Years,ExCoderBelonged,ExCoderSkills,ExCoderWillNotCode,ExCoderActive,PronounceGIF,ProblemSolving,BuildingThings,LearningNewTech,BoringDetails,JobSecurity,DiversityImportant,AnnoyingUI,FriendsDevelopers,RightWrongWay,UnderstandComputers,SeriousWork,InvestTimeTools,WorkPayCare,KinshipDevelopers,ChallengeMyself,CompetePeers,ChangeWorld,JobSeekingStatus,HoursPerWeek,LastNewJob,AssessJobIndustry,AssessJobRole,AssessJobExp,AssessJobDept,AssessJobTech,AssessJobProjects,AssessJobCompensation,AssessJobOffice,AssessJobCommute,AssessJobRemote,AssessJobLeaders,AssessJobProfDevel,AssessJobDiversity,AssessJobProduct,AssessJobFinances,ImportantBenefits,ClickyKeys,JobProfile,ResumePrompted,LearnedHiring,ImportantHiringAlgorithms,ImportantHiringTechExp,ImportantHiringCommunication,ImportantHiringOpenSource,ImportantHiringPMExp,ImportantHiringCompanies,ImportantHiringTitles,ImportantHiringEducation,ImportantHiringRep,ImportantHiringGettingThingsDone,Currency,Overpaid,TabsSpaces,EducationImportant,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,CousinEducation,WorkStart,HaveWorkedLanguage,WantWorkLanguage,HaveWorkedFramework,WantWorkFramework,HaveWorkedDatabase,WantWorkDatabase,HaveWorkedPlatform,WantWorkPlatform,IDE,AuditoryEnvironment,Methodology,VersionControl,CheckInCode,ShipIt,OtherPeoplesCode,ProjectManagement,EnjoyDebugging,InTheZone,DifficultCommunication,CollaborateRemote,MetricAssess,EquipmentSatisfiedMonitors,EquipmentSatisfiedCPU,EquipmentSatisfiedRAM,EquipmentSatisfiedStorage,EquipmentSatisfiedRW,InfluenceInternet,InfluenceWorkstation,InfluenceHardware,InfluenceServers,InfluenceTechStack,InfluenceDeptTech,InfluenceVizTools,InfluenceDatabase,InfluenceCloud,InfluenceConsultants,InfluenceRecruitment,InfluenceCommunication,StackOverflowDescribes,StackOverflowSatisfaction,StackOverflowDevices,StackOverflowFoundAnswer,StackOverflowCopiedCode,StackOverflowJobListing,StackOverflowCompanyPage,StackOverflowJobSearch,StackOverflowNewQuestion,StackOverflowAnswer,StackOverflowMetaChat,StackOverflowAdsRelevant,StackOverflowAdsDistracting,StackOverflowModeration,StackOverflowCommunity,StackOverflowHelpful,StackOverflowBetter,StackOverflowWhatDo,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
26580,Professional developer,No,United Kingdom,"Yes, full-time",Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...",100 to 499 employees,"Privately-held limited company, not in startup...",20 or more years,20 or more years,,Other,,,,8.0,8.0,,,,,,,,,"With a soft ""g,"" like ""jiff""",Strongly agree,Agree,Agree,Agree,Strongly agree,Agree,Strongly agree,Strongly disagree,Strongly agree,Disagree,Strongly agree,Somewhat agree,Somewhat agree,Somewhat agree,Agree,Disagree,Strongly disagree,I am not interested in new job opportunities,,More than 4 years ago,,,,,,,,,,,,,,,,Annual bonus; Vacation/days off; Health benefi...,No,,,,,,,,,,,,,,British pounds sterling (£),,Tabs,Very important,Online course; On-the-job training; Self-taugh...,Trade book,,None of these,,C#; SQL,,.NET Core,,SQL Server,,,,Visual Studio,Keep the room absolutely quiet,Waterfall; Agile; PRINCE2; Scrum; Kanban,Git,Multiple times a day,Agree,Disagree,Somewhat agree,Agree,Strongly agree,Somewhat agree,Agree,Bugs found,,,,,,,,,,,,,,,,,,"I've visited Stack Overflow, but haven't logge...",4.0,Desktop,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Strongly disagree,Strongly agree,Strongly disagree,Strongly disagree,Strongly disagree,Strongly disagree,Strongly disagree,Strongly disagree,Male,A bachelor's degree,White or of European descent,Strongly agree,Disagree,Disagree,Strongly disagree,,
7266,Professional developer,No,Israel,No,Employed full-time,Bachelor's degree,Psychology,Never,20 to 99 employees,Venture-funded startup,3 to 4 years,2 to 3 years,,Mobile developer,,iOS,,9.0,8.0,,,,,,,,,"With a hard ""g,"" like ""gift""",Agree,Strongly agree,Strongly agree,Somewhat agree,Strongly agree,Strongly agree,Agree,Agree,Agree,Somewhat agree,Strongly agree,Agree,Disagree,Agree,Agree,Agree,Somewhat agree,,,,,,,,,,,,,,,,,,,,Yes,,,,,,,,,,,,,,U.S. dollars ($),Somewhat underpaid,Tabs,Not very important,Part-time/evening course; Self-taught; Hackathon,Official documentation; Trade book; Stack Over...,,Bootcamp; Part-time/evening courses; Participa...,10:00 AM,JavaScript; Objective-C; Swift,,React; Xamarin,AngularJS; React,MongoDB,,iOS,Android; iOS,Sublime Text; Android Studio; Visual Studio; X...,Keep the room absolutely quiet,Waterfall; Agile; Scrum,Git,Multiple times a day,Agree,Disagree,Disagree,Agree,Agree,Disagree,Agree,Bugs found; Hours worked; On time/in budget; M...,Satisfied,Somewhat satisfied,Somewhat satisfied,Somewhat satisfied,Satisfied,Not very satisfied,I am the final decision maker,No influence at all,Not much influence,Some influence,A lot of influence,No influence at all,No influence at all,Some influence,No influence at all,I am the final decision maker,Not much influence,"I have a login for Stack Overflow, but haven't...",7.0,Desktop; Android browser,Several times,Several times,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Agree,Disagree,Strongly disagree,Agree,Agree,Agree,Somewhat agree,Disagree,Male,A professional degree,White or of European descent,Disagree,Somewhat agree,Strongly disagree,Agree,60000.0,


In [8]:
# Random sample for 206
survey_2016.sample(2)

Unnamed: 0,collector,country,un_subregion,so_region,age_range,age_midpoint,gender,self_identification,occupation,occupation_group,experience_range,experience_midpoint,salary_range,salary_midpoint,big_mac_index,tech_do,tech_want,aliens,programming_ability,employment_status,industry,company_size_range,team_size_range,women_on_team,remote,job_satisfaction,job_discovery,dev_environment,commit_frequency,hobby,dogs_vs_cats,desktop_os,unit_testing,rep_range,visit_frequency,why_learn_new_tech,education,open_to_new_job,new_job_value,job_search_annoyance,interview_likelihood,how_to_improve_interview_process,star_wars_vs_star_trek,agree_tech,agree_notice,agree_problemsolving,agree_diversity,agree_adblocker,agree_alcohol,agree_loveboss,agree_nightcode,agree_legacy,agree_mars,important_variety,important_control,important_sameend,important_newtech,important_buildnew,important_buildexisting,important_promotion,important_companymission,important_wfh,important_ownoffice,developer_challenges,why_stack_overflow
44054,Meta Stack Overflow Post,United States,North America,North America,30-34,32.0,Male,Programmer,Back-end web developer,Back-end web developer,2 - 5 years,3.5,"$70,000 - $80,000",75000.0,4.93,Java,Hadoop; Java; Spark,No,8.0,Employed full-time,Other (please specify),"10,000+ employees",5-9 people,1,Never,I'm somewhat satisfied with my job,Career fair,Eclipse,Multiple times a day,5-10 hours per week,Cats,Windows 7,Yes,"1,001 - 5,000",Multiple times a day,I want to be a better developer,Masters Degree in Computer Science (or related...,"I'm not actively looking, but I am open to new...",Work/life balance; Office location; Quality of...,Searching for a job that seems interesting,60%,Show me more live code; Fewer brainteasers,Star Trek,Agree completely,Agree somewhat,Neutral,Agree completely,Agree somewhat,Disagree completely,Agree somewhat,Neutral,Disagree completely,Disagree completely,This is very important,This is very important,I don't care about this,This is very important,I don't care about this,This is somewhat important,This is somewhat important,I don't care about this,I don't care about this,I don't care about this,Changing requirements; Unspecific requirements,To get help for my job; To give help to others...
26212,Meta Stack Overflow Post,India,Southern Asia,South Asia,25-29,27.0,Male,Developer; Programmer; Sr. Developer; Full-sta...,Back-end web developer,Back-end web developer,2 - 5 years,3.5,"Less than $10,000",5000.0,1.9,AngularJS; JavaScript; LAMP; PHP; SQL,AngularJS; JavaScript; LAMP; PHP; SQL,Yes,7.0,Employed full-time,Healthcare,20-99 employees,5-9 people,0,I rarely work remotely,I hate my job,Contacted by external recruiter,Notepad++; Sublime; PhpStorm,Multiple times a day,1-2 hours per week,Dogs,Ubuntu,Yes,"1,001 - 5,000",Multiple times a day,I want to be a better developer,I'm self-taught; On-the-job training,I am actively looking for a new job,Salary; Tech stack; Work/life balance; Office ...,Interesting companies rarely respond to me,Other (please specify),Be more flexible about interview scheduling; F...,Star Wars,Agree completely,Agree somewhat,Agree somewhat,Agree somewhat,Neutral,Neutral,Neutral,Agree somewhat,Neutral,Disagree somewhat,This is somewhat important,This is somewhat important,This is very important,This is very important,This is very important,This is very important,This is somewhat important,This is somewhat important,This is very important,This is very important,Poor scheduling; Unrealistic expectations; Int...,To give help to others; To discover new job op...


Now we have a better sense of what the data looks like, so we can proceed to picking the columns that we will need for the analysis, given the questions we outlines in [Section 1.](#business). These are the following:

The columns that indicate what language respondants want to learn are the following:
> Survey 2020: LanguageWorkedWith \
> Survey 2019: LanguageWorkedWith \
> Survey 2018: LanguageWorkedWith \
> Survey 2017: HaveWorkedLanguage \
> Survey 2016: tech_do

The columns that indicate how satisfied a respondant is with their job are:
> Survey 2020: JobSat \
> Survey 2019: JobSat \
> Survey 2018: JobSatisfaction \
> Survey 2017: JobSatisfaction \
> Survey 2016: job_satisfaction

The columns that indicate education status are the following:
> Survey 2020: EdLevel \
> Survey 2019: EdLevel \
> Survey 2018: FormalEducation \
> Survey 2017: FormalEducation \
> Survey 2016: education

The columns that indicate where the respondant lives are:
> Survey 2020: Country \
> Survey 2019: Country \
> Survey 2018: Country \
> Survey 2017: Country \
> Survey 2016: country

The columns that indicate what is the respondant's developer status are:
> Survey 2020: Gender \
> Survey 2019: Gender \
> Survey 2018: Gender \
> Survey 2017: Gender \
> Survey 2016: gender

The columns that indicate what is the respondant's employment status are:
> Survey 2020: Employment \
> Survey 2019: Employment \
> Survey 2018: Employment \
> Survey 2017: EmploymentStatus \
> Survey 2016: employment_status


Lastly, we can take a loop at the shapefiles we imported:

In [9]:
# Head of shapefiles
map_df.head()

Unnamed: 0,OBJECTID,CNTRY_NAME,CNTRY_CODE,BPL_CODE,geometry
0,1,Algeria,12,13010.0,"MULTIPOLYGON (((-2.05592 35.07370, -2.05675 35..."
1,2,Angola,24,12010.0,"MULTIPOLYGON (((12.79760 -4.41685, 12.79875 -4..."
2,3,In dispute South Sudan/Sudan,9999,99999.0,"POLYGON ((28.08408 9.34722, 28.03889 9.34722, ..."
3,4,Benin,204,15010.0,"MULTIPOLYGON (((1.93753 6.30122, 1.93422 6.299..."
4,5,Botswana,72,14010.0,"POLYGON ((25.16312 -17.77816, 25.16383 -17.778..."


So the countries are in the variable CNTRY_NAME. We will need to match these with our data and for that we will have to harmonize all the country names in [Section 3.](#prepare)

Given all of the above we can proceed to prepare our data!

<a name="prepare"></a>
## 3. Prepare data

Since we already know which columns we will need in order to answer our questions we can first start by droping columns that are not relevant to us:

In [11]:
# Put relevant variables in list
keep_2020 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2019 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2018 = ['LanguageWorkedWith', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'Employment']
keep_2017 = ['HaveWorkedLanguage', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'EmploymentStatus']
keep_2016 = ['tech_do', 'job_satisfaction', 'education', 'country', 'gender', 'employment_status']

# Keep only relevant variables
survey_2020 = survey_2020[keep_2020]
survey_2019 = survey_2019[keep_2019]
survey_2018 = survey_2018[keep_2018]
survey_2017 = survey_2017[keep_2017]
survey_2016 = survey_2016[keep_2016]

Nice! Now we can rename the columns so that all dataframes have the same names for variables

In [12]:
# Rename columns
survey_2020.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2019.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2018.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2017.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'EmploymentStatus': 'employment'}, inplace = True)
survey_2016.rename(columns={'tech_do': 'languages', 'job_satisfaction': 'job_satisfaction', 
                           'education': 'education', 'country': 'country', 'gender': 'gender', 
                           'employment_status': 'employment'}, inplace = True);

Now we need to harmonize the answers to different questions for all survey years in order to merge them and have a complete data set. Let's with some an easy one and look at the category for gender in each year.

In [44]:
# Initialize pandarallel
pandarallel.initialize()

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [13]:
# Print unique gender categories in 2020
survey_2020['gender'].unique()

array(['Man', nan, 'Woman',
       'Man;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man'], dtype=object)

In [14]:
# Print unique gender categories in 2019
survey_2019['gender'].unique()

array(['Man', nan, 'Woman',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man',
       'Man;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [15]:
# Print unique gender categories in 2018
survey_2018['gender'].unique()

array(['Male', nan, 'Female',
       'Female;Male;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Male',
       'Male;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming', 'Transgender',
       'Female;Transgender',
       'Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Non-binary, genderqueer, or gender non-conforming',
       'Female;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender', 'Female;Male;Transgender',
       'Female;Male;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [16]:
# Print unique gender categories in 2017
survey_2017['gender'].unique()

array(['Male', nan, 'Female', 'Gender non-conforming', 'Other',
       'Male; Gender non-conforming', 'Female; Transgender',
       'Male; Female', 'Male; Other', 'Transgender',
       'Transgender; Gender non-conforming',
       'Female; Gender non-conforming',
       'Male; Female; Transgender; Gender non-conforming; Other',
       'Male; Female; Transgender', 'Male; Female; Other',
       'Male; Female; Transgender; Gender non-conforming',
       'Male; Transgender', 'Female; Transgender; Gender non-conforming',
       'Gender non-conforming; Other',
       'Male; Female; Gender non-conforming', 'Female; Other',
       'Male; Transgender; Gender non-conforming', 'Transgender; Other',
       'Male; Gender non-conforming; Other',
       'Female; Gender non-conforming; Other',
       'Male; Female; Gender non-conforming; Other',
       'Female; Transgender; Other',
       'Female; Transgender; Gender non-conforming; Other',
       'Male; Transgender; Other', 'Male; Female; Transgender;

In [17]:
# Print unique gender categories in 2016
survey_2016['gender'].unique()

array(['Male', nan, 'Female', 'Prefer not to disclose', 'Other'],
      dtype=object)

Given what we see above, let's cluster all in the following four categories: female, male, other, nan. We can define a function to assing the value of Male, Female, Other or nan.

In [None]:
# Define function to harmonize gender
def harmonize_gender(df_raw):
    '''This function unifies all gender categories into 
    four: Male, Female, Other and nan. It also creates 
    binary variables for each of the above categories.
    '''
    # Copy df_raw
    df = df_raw.copy()
    # Binary variable for categories
    df['gender_male'] = 0
    df['gender_female'] = 0
    df['gender_other'] = 0
    df['gender_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define gender
        gender = str(df.loc[i, 'gender']).lower()
        # Value if male or man
        if gender == 'male' or gender == 'man':
            df.loc[i, 'gender'] = 'Male'
            df.loc[i, 'gender_male'] = 1
        # Value if female or woman
        elif gender == 'female' or gender == 'woman':
            df.loc[i, 'gender'] = 'Female'
            df.loc[i, 'gender_female'] = 1
        # Assign null values
        elif gender == 'nan':
            df.loc[i, 'gender'] = np.nan
            df.loc[i, 'gender_null'] = 1
        # Other categories lumped into other
        else:
            df.loc[i, 'gender'] = 'Other'
            df.loc[i, 'gender_other'] = 1
    # Return harmonized dataframe
    return(df)

# Apply gender harmonizer
survey_2020 = harmonize_gender(survey_2020)
survey_2019 = harmonize_gender(survey_2019)
survey_2018 = harmonize_gender(survey_2018)
survey_2017 = harmonize_gender(survey_2017)
survey_2016 = harmonize_gender(survey_2016)

We can take a quick look at what the data looks like now:

In [None]:
survey_2020.head()

Similarly, for job satisfaction we can look at the possible values

In [None]:
# Print unique job satisfaction categories in 2020
survey_2020['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2019
survey_2019['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2018
survey_2018['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2017
survey_2017['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2016
survey_2016['job_satisfaction'].unique()

We will try to lump all categories into six categories: Very satisfied, satisfied, Neither, Dissatisfied, Very Dissatisfied and nan. We will take a similar approach to gender and define a function to do this.

In [None]:
trial_list = survey_2017['job_satisfaction'].unique().tolist()
for i in trial_list:
    print(str(i) == '9.0')

In [None]:
# Define function to harmonize job satisfaction
def harmonize_jobsatisfaction(df_raw):
    '''This function harmonizes all the job
    satisfaction responses into: Very satisfied,
    Satisfied, Neither, Dissatisfied, Very dissatisfied
    and nan. It also creates binary variables for each
    of the above categories.'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    v_satisfied = ['very satisfied', 'extremely satisfied', 'i love my job', '10.0', '9.0']
    satisfied = ['slightly satisfied', 'moderately satisfied', 'i\'m somewhat satisfied with my job', '8.0', '7.0']
    neither = ['neither satisfied nor dissatisfied', 'i\'m neither satisfied nor dissatisfied', '6.0', '5.0', '4.0']
    dissatisfied = ['slightly dissatisfied', 'moderately dissatisfied', 'i\'m somewhat dissatisfied with my job', '3.0', '2.0']
    v_dissatisfied = ['very dissatisfied', 'extremely dissatisfied', 'i hate my job', '1.0', '0.0']
    # New binary variables
    df['jobsat_v_satisfied'] = 0
    df['jobsat_satisfied'] = 0
    df['jobsat_neither'] = 0
    df['jobsat_disssatisfied'] = 0
    df['jobsat_v_disssatisfied'] = 0
    df['jobsat_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define job satisfaction
        job_satisfac = str(df.loc[i, 'job_satisfaction']).lower()
        # Value if very satisfied and assign binary variable
        if job_satisfac in v_satisfied:
            df.loc[i, 'job_satisfaction'] = 'Very satisfied'
            df.loc[i,'jobsat_v_satisfied'] = 1
        # Value if satisfied and assign binary variable
        elif job_satisfac in satisfied:
            df.loc[i, 'job_satisfaction'] = 'Satisfied'
            df.loc[i, 'jobsat_satisfied'] = 1
        # Value if neither and assign binary variable
        elif job_satisfac in neither:
            df.loc[i, 'job_satisfaction'] = 'Neither'
            df.loc[i, 'jobsat_neither'] = 1
        # Value if dissatisfied and assign binary variable
        elif job_satisfac in dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Dissatisfied'
            df.loc[i, 'jobsat_dissatisfied'] = 1
        # Value if very dissatisfied and assign binary variable
        elif job_satisfac in v_dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Very Dissatisfied'
            df.loc[i, 'jobsat_v_dissatisfied'] = 1
        # Other categories become np.nan values
        else:
            df.loc[i, 'job_satisfaction'] = np.nan
            df.loc[i, 'jobsat_null'] = 1
    # Return harmonized dataframe
    return(df)
    
    
# Apply job satisfaction harmonizer
survey_2020 = harmonize_jobsatisfaction(survey_2020)
survey_2019 = harmonize_jobsatisfaction(survey_2019)
survey_2018 = harmonize_jobsatisfaction(survey_2018)
survey_2017 = harmonize_jobsatisfaction(survey_2017)
survey_2016 = harmonize_jobsatisfaction(survey_2016)

Again, we can see that this work by looking at a few rows:

In [None]:
survey_2017['job_satisfaction'].head()

Next, let's look at the employment variables and how they are layed out

In [None]:
# Print unique employment categories in 2020
survey_2020['employment'].unique()

In [None]:
# Print unique employment categories in 2019
survey_2019['employment'].unique()

In [None]:
# Print unique employment categories in 2018
survey_2018['employment'].unique()

In [None]:
# Print unique employment categories in 2017
survey_2017['employment'].unique()

In [None]:
# Print unique employment categories in 2016
survey_2016['employment'].unique()

Now we have to deal with the responses for employment status. This one seems a bit trickier as answer categories have changed over the years. With that in mind, let's create a function to harmonize these categories into the folowing: Full-time, Part-time, Self-employed, Not emplyed, Other and nan.

In [None]:
# Define function to harmonize employment categories
def harmonize_employment(df_raw):
    '''This function harmonizes all employment responses
    into: Full-time, Part-time, Self-employed, Not employed,
    Other and nan. It also creates binary variables for each
    of the above categories.'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    full_time = ['employed full-time']
    part_time = ['employed part-time']
    self_empl = ['independent contractor, freelancer, or self-employed', 'freelance / contractor', 'self-employed']
    not_employed = ['not employed, but looking for work', 'not employed, and not looking for work', 'unemployed']
    other = ['student', 'i\'m a student', 'retired', 'i prefer not to say', 'prefer not to disclose', 'other (please specify)']
    # New binary variables
    df['employment_full_time'] = 0
    df['employment_part_time'] = 0
    df['employment_self_empl'] = 0
    df['employment_not_empl'] = 0
    df['employment_other'] = 0
    df['employment_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define employment
        employment = str(df.loc[i, 'employment']).lower()
        # Value if full-time and assign binary variable
        if employment in full_time:
            df.loc[i, 'employment'] = 'Full-time'
            df.loc[i, 'employment_full_time'] = 1
        # Value if part-time and assign binary variable
        elif employment in part_time:
            df.loc[i, 'employment'] = 'Part-time'
            df.loc[i, 'employment_part_time'] = 1
        # Value if self-employed and assign binary variable
        elif employment in self_empl:
            df.loc[i, 'employment'] = 'Self-employed'
            df.loc[i, 'employment_self_empl'] = 1
        # Value if not employed and assign binary variable
        elif employment in not_employed:
            df.loc[i, 'employment'] = 'Not employed'
            df.loc[i, 'employment_not_empl'] = 1
        # Value if other and assign binary variable
        elif employment in other:
            df.loc[i, 'employment'] = 'Other'
            df.loc[i, 'employment_other'] = 1
        # Other categories become np.nan values
        else:
            df.loc[i, 'employment'] = np.nan
            df.loc[i, 'employment_null'] = 1
    # Return harmonized dataframe
    return(df)

# Apply employment harmonizer
survey_2020 = harmonize_employment(survey_2020)
survey_2019 = harmonize_employment(survey_2019)
survey_2018 = harmonize_employment(survey_2018)
survey_2017 = harmonize_employment(survey_2017)
survey_2016 = harmonize_employment(survey_2016)

Now let's look at education variables and their respective values.

In [None]:
# Print unique education categories in 2020
survey_2020['education'].unique()

In [None]:
# Print unique education categories in 2019
survey_2019['education'].unique()

In [None]:
# Print unique education categories in 2018
survey_2018['education'].unique()

In [None]:
# Print unique education categories in 2017
survey_2017['education'].unique()

In [None]:
# Print unique education categories in 2016
survey_2016['education'].unique().tolist()

The answers for 2016 look very different from the previous years. This is probably because respondants were allowed to tick more than one box. We can start to untangle this by putting all possible options in a list called education_options.

In [None]:
# Put education categories into list
education_2016 = survey_2016['education'].unique().tolist()

# Create empty list for possible education options
education_options = []
# Loop over answers and append only unique values
for i in education_2016:
    for opt in str(i).split(';'): # Since options are separated by ;
        # Remove leading white space and append only unique values
        education_options.append(opt.lstrip()) if opt.lstrip() not in education_options else None

This gives us the following available options for respondants

In [None]:
education_options

Now, we want to categorize people into the following categories: Primary education, Secondary education, Some college, Bachelor's , Professional degree, Master's, Doctorates.

Lastly, in order to be able to match the survey data we have with the geodata we imported we need to harminze the name of the countries. In order to do that, we can use the [country_converter library](https://pypi.org/project/country-converter/). We define the following function and apply it to the country values

In [None]:
# Define function to convert country name into ISO3
def country_iso3(df_raw, df_type = 'survey'):
    '''This function createts ISO3 country values 
    column'''
    # Check if df_type valid
    #if df_type != 'survey' or df_survey != 'map':
        ## Exception
    # Copy df_raw
    df = df_raw.copy()
    # If survey is passed
    if df_type == 'survey':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'country'])
            # Create to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, to = 'ISO3')
    # If map is passed
    elif df_type == 'map':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'CNTRY_NAME'])
            # Convert to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, to = 'ISO3')
    # Return dataframe
    return(df)

# Convert surveys into ISO3 
survey_2020 = country_iso3(survey_2020, df_type = 'survey')
survey_2019 = country_iso3(survey_2020, df_type = 'survey')
survey_2018 = country_iso3(survey_2020, df_type = 'survey')
survey_2017 = country_iso3(survey_2020, df_type = 'survey')
survey_2016 = country_iso3(survey_2020, df_type = 'survey')

# Convert map into ISO3
map_df = country_iso3(map_df, df_type = 'map')

In [41]:
# Make list of unique country names
country_list_2020 = survey_2020['country'].unique().tolist()
country_list_2019 = survey_2019['country'].unique().tolist()
country_list_2018 = survey_2018['country'].unique().tolist()
country_list_2017 = survey_2017['country'].unique().tolist()
country_list_2016 = survey_2016['country'].unique().tolist()
country_list_map = map_df['CNTRY_NAME'].unique().tolist()

# Define function to retrieve non-matches
def no_match_numeric(country_list):
    '''This function tries to match countries in country list and
    returns list with non-matched values to be reviewed'''
    no_match = []
    for country in country_list:
        try:
            countries.search_fuzzy(str(country))[0].numeric
        except:
            no_match.append(country)
    return(no_match)
    
# Get non-matched lists
no_match_2020 = no_match_alpha3(country_list_2020)
no_match_2019 = no_match_alpha3(country_list_2019)
no_match_2018 = no_match_alpha3(country_list_2018)
no_match_2017 = no_match_alpha3(country_list_2017)
no_match_2016 = no_match_alpha3(country_list_2016)

Now that we know which countries are not being matched we can edit their names exactly to get a perfect match.

In [None]:
# Replace 2020 country names where possible and nan if not
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[0]), np.nan)
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[1]), 'Venezuela')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[2]), 'Hong Kong Special Administrative Region of China')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[3]), 'Korea, Republic of')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[4]), 'Congo, The Democratic Republic of the')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[5]), 'Macedonia')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[6]), 'Libya')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[7]), 'Republic of the Congo')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[8]), 'Eswatini')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[9]), 'Micronesia')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[10]), 'Korea, Democratic People\'s Republic of')
survey_2020['country'] = survey_2020['country'].replace(str(no_match_2020[11]), 'Cabo Verde')

# Replace 2019 country names where possible and nan if not
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[0]), 'Korea, Republic of')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[1]), 'Hong Kong Special Administrative Region of China')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[2]), 'Cabo Verde')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[3]), 'Libya')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[4]), 'Venezuela')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[5]), np.nan)
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[6]), 'Macedonia')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[7]), 'Congo, The Democratic Republic of the')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[8]), 'Korea, Democratic People\'s Republic of')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[9]), 'Eswatini')
survey_2019['country'] = survey_2019['country'].replace(str(no_match_2019[10]), 'Republic of the Congo')

# Replace 2018 country names where possible and nan if not
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[0]), 'Iran, Islamic Republic of')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[1]), 'Hong Kong Special Administrative Region of China')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[2]), 'Korea, Republic of')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[3]), 'Venezuela')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[4]), np.nan)
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[5]), 'Macedonia')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[6]), 'Micronesia')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[7]), 'Eswatini')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[8]), 'Libya')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[9]), 'Congo, The Democratic Republic of the')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[10]), 'Republic of the Congo')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[11]), 'Korea, Democratic People\'s Republic of')
survey_2018['country'] = survey_2018['country'].replace(str(no_match_2018[12]), 'Cabo Verde')

# Replace 2017 country names where possible and nan if not
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[0]), np.nan)
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[1]), 'Moldova')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[2]), 'Korea, Republic of')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[3]), 'Bosnia')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[4]), 'Netherlands')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[5]), 'Virgin Islands, U.S.')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[6]), 'Cabo Verde')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[7]), 'Korea, Democratic People\'s Republic of')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[8]), 'Azerbaijan')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[9]), 'South Georgia and the South Sandwich Islands')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[10]), 'Virgin Islands, British')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[11]), 'Réunion')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[12]), 'New Caledonia')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[13]), 'Lao')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[14]), 'Tajikistan')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[15]), 'Cote d\'Ivoire')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[16]), 'United States Minor Outlying Islands')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[17]), 'Polynesia')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[18]), 'France')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[19]), 'Pitcairn')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[20]), 'Eswatini')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[21]), 'Saint Vincent and the Grenadines')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[22]), 'Martinique')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[23]), 'Macao')
survey_2017['country'] = survey_2017['country'].replace(str(no_match_2017[24]), 'Heard Island and McDonald Islands')

# Replace 2016 country names where possible and nan if not
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[0]), 'Antigua and Barbuda')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[1]), 'Bosnia')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[2]), 'Ireland')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[3]), 'Cote d\'Ivoire')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[4]), 'Korea, Republic of')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[5]), 'Lao')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[6]), 'Myanmar')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[7]), np.nan)
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[8]), 'Sao Tome and Principe')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[9]), 'Korea, Democratic People\'s Republic of')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[10]), 'Saint Kitts and Nevis')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[11]), 'Trinidad and Tobago')
survey_2016['country'] = survey_2016['country'].replace(str(no_match_2016[12]), 'Timor-Leste')


Now that the countries have the names they need in order to be matched to an country code, we can create a new column that identifies what the country's code is.

In [None]:
%%time
survey_2020['country'].apply(lambda x: countries.search_fuzzy(str(x))[0].numeric)

In [None]:
import dask.dataframe as dd

survey_2020_dd = dd.from_pandas(survey_2020, npartitions=30)

In [None]:
%%time
survey_2020_dd['country'].apply(lambda x: countries.search_fuzzy(str(x))[0].alpha_3, meta = ('str')).compute()

In [None]:

pandarallel.initialize()
#start 3:54
survey_2020['country'].parallel_apply(lambda x: countries.search_fuzzy(str(x))[0].alpha_3)

In [None]:
# Define function to create ISO3 code value in column
def country_numeric(df_raw):
    '''This function creates a column that contains the
    country's code'''
    # Setup
    df = df_raw.copy()
    df['country_code'] = np.nan
    # Assign iso3 value to new column
    df['country_code'] = df['country_code'].parallel_apply(lambda x: countries.search_fuzzy(str(x))[0].numeric)
    # Return new dataframe
    return(df)

In [None]:
no_match_2016

In [None]:
no_match_2018 - Cape Verde

In [None]:
no_match_2016 - East Timor

With that in hands we can add a variable at the end of each dataset to mark the year it represents and merge them

In [None]:
# Add year variable to dataframes
survey_2020['year'] = 2020
survey_2019['year'] = 2019
survey_2018['year'] = 2018
survey_2017['year'] = 2017
survey_2016['year'] = 2016

# Merge datasets into one
data = [survey_2020, survey_2019, survey_2018, survey_2017, survey_2016]
survey = pd.concat(data, ignore_index = True)

<a name="model"></a>
## 4. Data modeling

Now that we cleaned and organized our data, we can proceed to answer the questions proposed in [Section 1.](#business).

> What are the most popular programming languages over the past five years?\
> What countries have more female respondants?\
> Are male respondants happier than female respondants?\
> What countries have the higher job satisfaction rates?

### What are the most popular programming languages over the past five years?

Our first question requires us to look at what languages the respondants said they knew how to use and analyze how this has changed over the years.

### What countries have more female respondants?

A big problem in tech (and many other industries) is barrier many women face to get into this industry. We can take a look at how the composition of respondants of the Stack Overflow Annual Developer has changed in order to have an idea if more women are participating in the most important forum for programmers.

In [None]:
# Define survey with average of gender categories
df_gender = survey[['year', 'gender_male', 'gender_female', 'gender_other', 'gender_null']].groupby('year', as_index = False).mean()

# Print head
df_gender.head()

Above we can see a table with the composition of respondants by gender for the past five years. However, it might be easier to understand what is happening with a graph.

In [None]:
# Set figure size
plt.figure(figsize=(12,8))

# Define graph for each gender category
sns.lineplot(x = 'year', y = 'gender_female', data = df_gender, legend='brief', marker = 'o', label = 'Female')
sns.lineplot(x = 'year', y = 'gender_male', data = df_gender, legend='brief', marker = 'o', label = 'Male')
sns.lineplot(x = 'year', y = 'gender_other', data = df_gender, legend='brief', marker = 'o', label = 'Other')
sns.lineplot(x = 'year', y = 'gender_null', data = df_gender, legend='brief', marker = 'o', label = 'Not declared')

# Set details of plot
plt.title('Gender of respondants', fontsize = 16)
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Percentage", fontsize = 14)
plt.xticks(df_gender['year'])
plt.yticks([0,.2,.4, .6, .8, 1])
plt.gca().spines['bottom'].set_position(('data',0))
plt.legend(loc = 'center right', frameon = False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# Plot graph
plt.show();

I looks like women have constantly been under 10% of the respondant pool. It is worth noting, however, that many people choose to not declare their gender. We might want to look at what the composition of male, female and other are only among those people who chose to declare their gender.

In [None]:
# Define gender adjusted dataset
df_gender_adj = df_gender[['year', 'gender_male', 'gender_female', 'gender_other']]

# Set sum of relevat variables
sum_gender = df_gender_adj[['gender_male', 'gender_female', 'gender_other']].sum(axis=1)

# Adjust categories by only those who declared their gender
df_gender_adj['gender_male'] = df_gender_adj['gender_male']/sum_gender
df_gender_adj['gender_female'] = df_gender_adj['gender_female']/sum_gender
df_gender_adj['gender_other'] = df_gender_adj['gender_other']/sum_gender

# Print adjusted gender distributions
df_gender_adj.head()

With this in hands we can reproduce the graph we did before.

In [None]:
# Set figure size
plt.figure(figsize=(12,8))

# Define graph for each gender category
sns.lineplot(x = 'year', y = 'gender_female', data = df_gender_adj, legend='brief', marker = 'o', label = 'Female')
sns.lineplot(x = 'year', y = 'gender_male', data = df_gender_adj, legend='brief', marker = 'o', label = 'Male')
sns.lineplot(x = 'year', y = 'gender_other', data = df_gender_adj, legend='brief', marker = 'o', label = 'Other')

# Set details of plot
plt.title('Gender of respondants', fontsize = 16)
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Percentage", fontsize = 14)
plt.xticks(df_gender['year'])
plt.yticks([0,.2,.4, .6, .8, 1])
plt.gca().spines['bottom'].set_position(('data',0))
plt.legend(loc = 'center right', frameon = False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# Plot graph
plt.show();

This is no surprise as men did constitute the majority of respondants before. This exercise does indicate, however, that there is a lack of participation of women (and other gender identifications) in the Stack Overflow Annual Developer Survey, which could indicate a wider trend in the tech industry that needs to be addressed. Ideally we would want a higher participation of women in the tech industry.

### Are male respondants happier than female respondants?

Seeing that the majority of survey respondants are men, we could check if this translates into men having a higher job satisfaction than women and other gender identities.

In [None]:
score_jobsat = survey[['job_satisfaction', 'gender', 'year']].copy()

In [None]:
score_jobsat['year'].unique()

In [None]:
# Create new satisfaction score variable
score_jobsat['satisfaction_score'] = np.nan

# Drop rows with null scores
score_jobsat = score_jobsat.dropna(subset = ['job_satisfaction'])

# Loop rows to assign score for score
for i in tqdm(score_jobsat.index):
    if str(score_jobsat.loc[i, 'job_satisfaction']) == 'Very satisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 5
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Satisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 4
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Neither':
        score_jobsat.loc[i, 'satisfaction_score'] = 3
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Dissatisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 2
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Very Dissatisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 1


In [None]:
#df_jobsat = score_jobsat[['year', 'gender_male', 'gender_female', 'gender_other', 'gender_null']].groupby('year', as_index = False).mean()


In [None]:
# Set figure size
plt.figure(figsize = (12,8))

# Define graph for satisfaction per gender over years
g = sns.catplot(x = 'year', y = 'satisfaction_score', hue = 'gender', kind= 'bar', data = score_jobsat)
g._legend.set_title("Gender")
# Set details of plot
plt.title('Satisfaction of respondants by gender (2016 - 2020)', fontsize = 16)
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Satisfaction score', fontsize = 14);

It doesn't look like there is a significantly difference in job satisfaction over the years between genders. Note, however, that it does seem that happiness levels were slightly higher in 2016 if compared to the other years.

### What countries have the highest job satisfaction rates?

<a name="eval"></a>
## 5. Evaluate the results

Text text

<a name="deploy"></a>
## 6. Deploy

Text text

In [None]:
os.getcwd()

In [None]:
import geopandas as gpd

In [None]:
map_df = gpd.read_file('IPUMSI_world_release2017/world_countries_2017.shp')

In [None]:
ax = map_df.plot()
ax.axis('off');

In [None]:
map_df.head()

In [None]:
countries = map_df['CNTRY_NAME'].unique().tolist()

In [None]:
'Bahamas' in countries

In [None]:
countries2 = survey_2020['Country'].unique().tolist()

In [None]:
import country_converter as coco

iso_lst1 = []
iso_lst2 = []

for country in countries:
    iso1 = coco.convert(names=country, to='ISO3')
    iso_lst1.append(iso1)

for country2 in countries2:
    iso2 = coco.convert(names=country2, to='ISO3')
    iso_lst2.append(iso2)


In [None]:
for i in iso_lst2:
    print(i, i in iso_lst1)

In [None]:
'US' in iso_lst1

In [None]:
coco.convert(names='United States of America', to='ISO3')

In [None]:
'USA' in iso_lst2

In [None]:
for i in survey_2020.index:
    survey_2020.loc[i, 'Country'] = coco.convert(names = str(survey_2020.loc[i, 'Country']), to = 'ISO3')

In [None]:
survey_2020.head()