# Title here

Description here

## Table of contents
- [1. Business undestanding](#business)
- [2. Data understanding](#data)
    - [2.1. Gathering data](#gather)
    - [2.2. Assessing data](#assess)
- [3. Prepare data](#prepare)
- [4. Data modeling](#model)
- [5. Evaluate the results](#eval)
- [6. Deploy](#deploy)

<a name="business"></a>
## 1. Business understanding

In this notebook we will try to address the following questions using data from [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey).

> What are the most popular programming languages over the past five years?\
> What countries have more female respondants?\
> Are male respondants happier than female respondants?\
> What countries have the highest job satisfaction rates?

The following sections serve as a guide in order to understand the data and what needs to be done in order to answer the questions above.

<a name="data"></a>
## 2. Data understanding

We begin our work by downloading the data that we will need in order to address the questions layed out in [Section 1.](#business). We will then proceed to taking a look at our data to get a sense of what needs to be changed later on

<a name="gather"></a>
   

<a name="gather"></a>
### 2.1. Gathering data

First, we need to download all the necessary data. In order to do so, we can run the line below to download all Stack Overflow surveys for all years:

In [None]:
# Download survey data
%run -i '../download/download.py'

# Download shape files
%run -i '../download/shape.py'

These are all the surveys since 2011. We will only use the ones from the last five years. One of the reasons for doing so is that the structure of the survey changed and similar questions might not be comparable anymore. Next, in preparation for the next sections we can import the relevant libraries.

In [1]:
# Import libraries
import country_converter as coco
import geopandas as gpd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pycountry import countries
from tqdm.auto import tqdm
%matplotlib inline

<a name="assess"></a>
### 2.2. Assessing data

Now that we have downloaded all the datasets let's start by reading the csvs from the past five years. In doing so, I am ignoring the first column as it serves as an ordered identifier for the respondants.

In [2]:
# Import survey data and skip first column
import warnings; warnings.simplefilter('ignore')
survey_2016 = pd.read_csv("../data/survey/survey_2016.csv").iloc[:, 1:]
survey_2017 = pd.read_csv("../data/survey/survey_2017.csv").iloc[:, 1:]
survey_2018 = pd.read_csv("../data/survey/survey_2018.csv").iloc[:, 1:]
survey_2019 = pd.read_csv("../data/survey/survey_2019.csv").iloc[:, 1:]
survey_2020 = pd.read_csv("../data/survey/survey_2020.csv").iloc[:, 1:]

# Import shapefile with geopandas
#map_df = gpd.read_file("../data/shapefile/world_countries_2017.shp")

Great! Now we can quickly look at what these datasets look like. I will do that by picking two random samples from the survey.

In [3]:
# Show dataframe for two random samples for 2020
pd.options.display.max_columns = None # to show all columns
survey_2020.sample(2)

Unnamed: 0,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,CurrencySymbol,DatabaseDesireNextYear,DatabaseWorkedWith,DevType,EdLevel,Employment,Ethnicity,Gender,JobFactors,JobSat,JobSeek,LanguageDesireNextYear,LanguageWorkedWith,MiscTechDesireNextYear,MiscTechWorkedWith,NEWCollabToolsDesireNextYear,NEWCollabToolsWorkedWith,NEWDevOps,NEWDevOpsImpt,NEWEdImpt,NEWJobHunt,NEWJobHuntResearch,NEWLearn,NEWOffTopic,NEWOnboardGood,NEWOtherComms,NEWOvertime,NEWPurchaseResearch,NEWPurpleLink,NEWSOSites,NEWStuck,OpSys,OrgSize,PlatformDesireNextYear,PlatformWorkedWith,PurchaseWhat,Sexuality,SOAccount,SOComm,SOPartFreq,SOVisitFreq,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
20622,I am a developer by profession,Yes,29.0,14,Yearly,,,India,Indian rupee,INR,,DynamoDB;Elasticsearch;MongoDB;MySQL;Redis,"Developer, full-stack","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,South Asian,Man,Flex time or a flexible schedule;Remote work o...,Slightly dissatisfied,I am actively looking for a job,HTML/CSS;JavaScript;Python,HTML/CSS;JavaScript;Python,Flutter;TensorFlow,Cordova;Node.js;React Native,Stack Overflow for Teams,Confluence;Jira;Github;Gitlab;Slack;Google Sui...,No,Extremely important,Not at all important/not necessary,Just because,"Read company media, such as employee blogs or ...",Every few months,No,No,No,Sometimes: 1-2 days per month but less than we...,Research companies that have advertised on sit...,"Hello, old friend",Stack Overflow (public Q&A for anyone who codes),Panic,MacOS,100 to 499 employees,Android;Kubernetes;Raspberry Pi,AWS;Docker;Google Cloud Platform;Heroku;WordPress,I have some influence,Straight / Heterosexual,Yes,"Yes, definitely",Daily or almost daily,Daily or almost daily,Difficult,Too long,No,"Another engineering discipline (such as civil,...",Vue.js,Angular.js;Express;jQuery;React.js,Somewhat more welcome now than last year,45.0,6,6
33673,I am a developer by profession,Yes,26.0,18,Monthly,35000.0,3672.0,Nepal,Nepalese rupee,NPR,DynamoDB;Elasticsearch;IBM DB2;Microsoft SQL S...,Firebase;Microsoft SQL Server;MongoDB;MySQL;Po...,Data scientist or machine learning specialist;...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,East Asian;South Asian;Southeast Asian,Man,"Languages, frameworks, and other technologies ...",Very satisfied,I am actively looking for a job,C#;Dart;HTML/CSS;JavaScript;Kotlin;SQL;TypeScript,C;C#;C++;HTML/CSS;Java;JavaScript;Python;SQL;T...,.NET;.NET Core;Node.js;TensorFlow;Unity 3D,.NET;.NET Core;Node.js;React Native;Unity 3D,Confluence;Jira;Github;Gitlab;Slack;Microsoft ...,Jira;Github;Gitlab;Slack;Trello,Yes,Neutral,Not at all important/not necessary,Curious about other opportunities;Wanting to w...,Personal network - friends or family,Every few months,Not sure,Yes,Yes,Occasionally: 1-2 days per quarter but less th...,Start a free trial;Ask developers I know/work ...,"Hello, old friend",Stack Overflow (public Q&A for anyone who code...,Call a coworker or friend;Visit Stack Overflow...,Windows,20 to 99 employees,AWS;Docker;Kubernetes;Microsoft Azure;Windows,Android;Docker;Heroku;Windows,I have some influence,Straight / Heterosexual,Yes,"Yes, somewhat",I have never participated in Q&A on Stack Over...,A few times per week,Easy,Too long,No,"Computer science, computer engineering, or sof...",Angular;ASP.NET Core,Angular;ASP.NET Core;Express;Flask;jQuery;Reac...,Just as welcome now as I felt last year,40.0,8,4


And for the remaining years we see:

In [4]:
# Random sample for 2019
survey_2019.sample(2)

Unnamed: 0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,OrgSize,DevType,YearsCode,Age1stCode,YearsCodePro,CareerSat,JobSat,MgrIdiot,MgrMoney,MgrWant,JobSeek,LastHireDate,LastInt,FizzBuzz,JobFactors,ResumeUpdate,CurrencySymbol,CurrencyDesc,CompTotal,CompFreq,ConvertedComp,WorkWeekHrs,WorkPlan,WorkChallenge,WorkRemote,WorkLoc,ImpSyn,CodeRev,CodeRevHrs,UnitTests,PurchaseHow,PurchaseWhat,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,WebFrameWorkedWith,WebFrameDesireNextYear,MiscTechWorkedWith,MiscTechDesireNextYear,DevEnviron,OpSys,Containers,BlockchainOrg,BlockchainIs,BetterLife,ITperson,OffOn,SocialMedia,Extraversion,ScreenName,SOVisit1st,SOVisitFreq,SOVisitTo,SOFindAnswer,SOTimeSaved,SOHowMuchTime,SOAccount,SOPartFreq,SOJobs,EntTeams,SOComm,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
60044,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of LOWER quality than prop...",Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Participated in a full-time developer training...,"10,000 or more employees","Developer, back-end;Developer, desktop or ente...",35,9,26,Very satisfied,Very satisfied,Very confident,Not sure,No,I am not interested in new job opportunities,More than 4 years ago,Write any code,No,"Languages, frameworks, and other technologies ...","My job status changed (promotion, new job, etc.)",USD,United States dollar,104000.0,Yearly,104000.0,40.0,There's no schedule or spec; I work on what se...,Lack of support from management;Not enough peo...,"Less than half the time, but at least one day ...",Office,A little above average,"Yes, because I see value in code review",1.0,"Yes, it's part of our process",Not sure,I have little or no influence,C#;SQL,C++;C#;Java;JavaScript;SQL;TypeScript;WebAssembly,DynamoDB;Microsoft SQL Server,Microsoft SQL Server,Android;Windows,Android;Windows,,,.NET;.NET Core;Node.js;Xamarin,.NET;.NET Core;Unity 3D,Eclipse;Notepad++;Visual Studio;Visual Studio ...,Windows,I do not use containers,,"Useful for decentralized currency (i.e., Bitcoin)",Yes,Also Yes,Yes,Facebook,Online,Username,2009,Daily or almost daily,Find answers to specific questions;Contribute ...,1-2 times per week,Stack Overflow was much faster,60+ minutes,Yes,A few times per month or weekly,Yes,"No, and I don't know what those are","Yes, definitely",Just as welcome now as I felt last year,,44.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Too long,Neither easy nor difficult
87032,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Sri Lanka,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,10 to 19 employees,"Developer, back-end;Developer, front-end;Devel...",4,20,1,Very satisfied,Very satisfied,Very confident,Yes,Yes,"I’m not actively looking, but I am open to new...",Less than a year ago,Write any code;Solve a brain-teaser style puzz...,No,Remote work options;Opportunities for professi...,"Something else changed (education, award, medi...",LKR,Sri Lankan rupee,40000.0,Monthly,2712.0,40.0,There is a schedule and/or spec (made by me or...,Time spent commuting,Less than once per month / Never,"Other place, such as a coworking space or cafe",Average,"Yes, because I see value in code review",4.0,"Yes, it's not part of our process but the deve...",Not sure,I have little or no influence,C#;HTML/CSS;Java;JavaScript;PHP;SQL,C#;HTML/CSS;JavaScript;SQL;Swift,Microsoft SQL Server;MySQL;SQLite,Microsoft SQL Server;SQLite,Android;AWS;Microsoft Azure;WordPress,Android;AWS,Angular/Angular.js;ASP.NET,Angular/Angular.js;ASP.NET,.NET;.NET Core;Node.js;Xamarin,Flutter;.NET;.NET Core;Xamarin,Android Studio;Atom;Visual Studio,Windows,I do not use containers,,Useful across many domains and could change ma...,Yes,Yes,No,Facebook,Neither,Login,2014,Daily or almost daily,Find answers to specific questions;Contribute ...,1-2 times per week,Stack Overflow was much faster,11-30 minutes,Yes,Daily or almost daily,"No, I knew that Stack Overflow had a job board...","No, I've heard of them, but I am not part of a...","Yes, definitely",Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,24.0,Man,No,,South Asian,No,Appropriate in length,Neither easy nor difficult


In [5]:
# Random sample for 2018
survey_2018.sample(2)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,AssessJob1,AssessJob2,AssessJob3,AssessJob4,AssessJob5,AssessJob6,AssessJob7,AssessJob8,AssessJob9,AssessJob10,AssessBenefits1,AssessBenefits2,AssessBenefits3,AssessBenefits4,AssessBenefits5,AssessBenefits6,AssessBenefits7,AssessBenefits8,AssessBenefits9,AssessBenefits10,AssessBenefits11,JobContactPriorities1,JobContactPriorities2,JobContactPriorities3,JobContactPriorities4,JobContactPriorities5,JobEmailPriorities1,JobEmailPriorities2,JobEmailPriorities3,JobEmailPriorities4,JobEmailPriorities5,JobEmailPriorities6,JobEmailPriorities7,UpdateCV,Currency,Salary,SalaryType,ConvertedSalary,CurrencySymbol,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AdsPriorities1,AdsPriorities2,AdsPriorities3,AdsPriorities4,AdsPriorities5,AdsPriorities6,AdsPriorities7,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
92818,Yes,No,Germany,"Yes, full-time","Not employed, but looking for work",Some college/university study without earning ...,"Computer science, computer engineering, or sof...",,Student,3-5 years,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10 (Very Likely),Daily or almost daily,Yes,Less than once per month or monthly,Yes,Yes,8,I'm not sure,,,,,,,,,,,,,,,,,,,,
65518,Yes,Yes,Netherlands,No,Employed full-time,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","1,000 to 4,999 employees",Back-end developer;Full-stack developer,21-23 years,18-20 years,Moderately satisfied,Moderately satisfied,Doing the same work,"I’m not actively looking, but I am open to new...",More than 4 years ago,10.0,7.0,9.0,3.0,4.0,2.0,6.0,1.0,8.0,5.0,1.0,8.0,5.0,6.0,9.0,3.0,11.0,10.0,7.0,4.0,2.0,5.0,3.0,4.0,1.0,2.0,1.0,7.0,3.0,5.0,2.0,6.0,4.0,My job status or other personal status changed,Euros (€),4800.0,Monthly,70500.0,EUR,"Jira;Slack;Other wiki tool (Github, Google Sit...",One to three months,Taken an online course in programming or softw...,The official documentation and/or standards fo...,,Because I find it enjoyable,Agree,Neither Agree nor Disagree,Strongly disagree,C++;C#;PHP;SQL;Delphi/Object Pascal,C++;C#;PHP;SQL;Delphi/Object Pascal,MySQL;Oracle,PostgreSQL;Oracle;Neo4j,Arduino;AWS;Raspberry Pi;Windows Desktop or Se...,Arduino;AWS;Windows Desktop or Server,,,Notepad++;PHPStorm;Visual Studio,Windows,2.0,Agile;Pair programming;Scrum,Git,Once a day,No,,,Somewhat agree,Somewhat agree,Somewhat agree,Stopped going to a website because of their ad...,2.0,1.0,3.0,4.0,7.0,6.0,5.0,Artificial intelligence surpassing human intel...,"Evolving definitions of ""fairness"" in algorith...",A governmental or other regulatory body,I'm excited about the possibilities more than ...,No,Depends on what it is,Upper management at the company/organization,Yes,9,Multiple times per day,Yes,Multiple times per day,Yes,Yes,6,Yes,Not at all interested,Not at all interested,Not at all interested,A little bit interested,Not at all interested,Between 7:01 - 8:00 AM,9 - 12 hours,1 - 2 hours,Never,Standing desk;Ergonomic keyboard or mouse,I don't typically exercise,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Very easy


In [6]:
# Random sample for 2017
survey_2017.sample(2)

Unnamed: 0,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,CompanyType,YearsProgram,YearsCodedJob,YearsCodedJobPast,DeveloperType,WebDeveloperType,MobileDeveloperType,NonDeveloperType,CareerSatisfaction,JobSatisfaction,ExCoderReturn,ExCoderNotForMe,ExCoderBalance,ExCoder10Years,ExCoderBelonged,ExCoderSkills,ExCoderWillNotCode,ExCoderActive,PronounceGIF,ProblemSolving,BuildingThings,LearningNewTech,BoringDetails,JobSecurity,DiversityImportant,AnnoyingUI,FriendsDevelopers,RightWrongWay,UnderstandComputers,SeriousWork,InvestTimeTools,WorkPayCare,KinshipDevelopers,ChallengeMyself,CompetePeers,ChangeWorld,JobSeekingStatus,HoursPerWeek,LastNewJob,AssessJobIndustry,AssessJobRole,AssessJobExp,AssessJobDept,AssessJobTech,AssessJobProjects,AssessJobCompensation,AssessJobOffice,AssessJobCommute,AssessJobRemote,AssessJobLeaders,AssessJobProfDevel,AssessJobDiversity,AssessJobProduct,AssessJobFinances,ImportantBenefits,ClickyKeys,JobProfile,ResumePrompted,LearnedHiring,ImportantHiringAlgorithms,ImportantHiringTechExp,ImportantHiringCommunication,ImportantHiringOpenSource,ImportantHiringPMExp,ImportantHiringCompanies,ImportantHiringTitles,ImportantHiringEducation,ImportantHiringRep,ImportantHiringGettingThingsDone,Currency,Overpaid,TabsSpaces,EducationImportant,EducationTypes,SelfTaughtTypes,TimeAfterBootcamp,CousinEducation,WorkStart,HaveWorkedLanguage,WantWorkLanguage,HaveWorkedFramework,WantWorkFramework,HaveWorkedDatabase,WantWorkDatabase,HaveWorkedPlatform,WantWorkPlatform,IDE,AuditoryEnvironment,Methodology,VersionControl,CheckInCode,ShipIt,OtherPeoplesCode,ProjectManagement,EnjoyDebugging,InTheZone,DifficultCommunication,CollaborateRemote,MetricAssess,EquipmentSatisfiedMonitors,EquipmentSatisfiedCPU,EquipmentSatisfiedRAM,EquipmentSatisfiedStorage,EquipmentSatisfiedRW,InfluenceInternet,InfluenceWorkstation,InfluenceHardware,InfluenceServers,InfluenceTechStack,InfluenceDeptTech,InfluenceVizTools,InfluenceDatabase,InfluenceCloud,InfluenceConsultants,InfluenceRecruitment,InfluenceCommunication,StackOverflowDescribes,StackOverflowSatisfaction,StackOverflowDevices,StackOverflowFoundAnswer,StackOverflowCopiedCode,StackOverflowJobListing,StackOverflowCompanyPage,StackOverflowJobSearch,StackOverflowNewQuestion,StackOverflowAnswer,StackOverflowMetaChat,StackOverflowAdsRelevant,StackOverflowAdsDistracting,StackOverflowModeration,StackOverflowCommunity,StackOverflowHelpful,StackOverflowBetter,StackOverflowWhatDo,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
26635,Professional developer,"Yes, I program as a hobby",United Kingdom,No,Employed full-time,Bachelor's degree,A non-computer-focused engineering discipline,Never,500 to 999 employees,"Privately-held limited company, not in startup...",20 or more years,20 or more years,,Desktop applications developer,,,,10.0,9.0,,,,,,,,,"With a hard ""g,"" like ""gift""",,,,,,,,,,,,,,,,,,I am not interested in new job opportunities,,Less than a year ago,,,,,,,,,,,,,,,,Vacation/days off; Equipment; Expected work ho...,No,CW_Jobs; Dice; JobSite.co.uk; Monster; Reed.co...,I received negative feedback on my job perform...,A general-purpose job board,Important,Very important,Very important,Somewhat important,Not very important,Not very important,Somewhat important,Somewhat important,Somewhat important,Very important,British pounds sterling (£),,Spaces,Important,On-the-job training; Self-taught; Open source ...,Stack Overflow Q&A; Non-Stack online communities,,,10:00 AM,C++; C#; JavaScript; TypeScript,F#; Java,AngularJS; .NET Core,AngularJS; Xamarin; .NET Core,MongoDB; SQL Server; SQLite,MongoDB; SQL Server; SQLite,Windows Desktop; Arduino; Raspberry Pi,Raspberry Pi; Amazon Web Services (AWS),Visual Studio; Visual Studio Code,Keep the room absolutely quiet,Agile; Scrum; Extreme; Pair; Kanban,Git,Multiple times a day,Strongly agree,Strongly disagree,Agree,Strongly agree,Strongly agree,Strongly disagree,Strongly agree,Peers' rating,,,,,,,,,,,,,,,,,,"I've visited Stack Overflow, but haven't logge...",10.0,Desktop,At least once each day,At least once each day,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Agree,Strongly disagree,Strongly disagree,Strongly agree,Strongly agree,Strongly agree,Strongly agree,Disagree,Male,High school,White or of European descent,Agree,Agree,Strongly disagree,Strongly agree,,
46240,Student,"Yes, I program as a hobby",Belgium,"Yes, full-time","Independent contractor, freelancer, or self-em...",Bachelor's degree,Mathematics or statistics,Never,,,5 to 6 years,2 to 3 years,,,,,Other,7.0,5.0,,,,,,,,,"With a hard ""g,"" like ""gift""",Agree,Strongly agree,Strongly agree,Somewhat agree,Agree,Somewhat agree,Agree,Strongly disagree,Agree,Disagree,Agree,Agree,Somewhat agree,Agree,Somewhat agree,Agree,Agree,"I'm not actively looking, but I am open to new...",0.0,Between 1 and 2 years ago,Not very important,Important,Important,Somewhat important,Very important,Somewhat important,Important,Very important,Very important,Somewhat important,Somewhat important,Very important,Not very important,Not at all important,Somewhat important,Annual bonus; Vacation/days off; Health benefi...,Yes,,,,Somewhat important,Somewhat important,Somewhat important,Not very important,Not at all important,Not very important,Not very important,Somewhat important,Not at all important,Somewhat important,Euros (€),,Both,,Online course; Self-taught,Official documentation,,Take online courses; Buy books and work throug...,10:00 AM,C++; Haskell; Java; JavaScript; Matlab; Python,C; C++; Haskell; Java; JavaScript; Matlab; Python,,React,,,Windows Desktop,Android; iOS; Windows Desktop; Linux Desktop; ...,Vim; IntelliJ; IPython / Jupyter; Visual Studio,"Put on some ambient sounds (e.g. whale songs, ...",Pair,Git,A few times a week,Disagree,Somewhat agree,Somewhat agree,Disagree,Strongly agree,Disagree,Somewhat agree,Customer satisfaction; Benchmarked product per...,Somewhat satisfied,Not very satisfied,Not very satisfied,Not at all satisfied,Not at all satisfied,Somewhat satisfied,,,,,,,,,,,,"I have a login for Stack Overflow, but haven't...",9.0,Desktop; iOS browser,Several times,Haven't done at all,Once or twice,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Somewhat agree,Disagree,Disagree,Disagree,Agree,Agree,Disagree,Disagree,Male,A master's degree,White or of European descent,Somewhat agree,Agree,Strongly disagree,Strongly agree,,32258.064516


In [7]:
# Random sample for 206
survey_2016.sample(2)

Unnamed: 0,collector,country,un_subregion,so_region,age_range,age_midpoint,gender,self_identification,occupation,occupation_group,experience_range,experience_midpoint,salary_range,salary_midpoint,big_mac_index,tech_do,tech_want,aliens,programming_ability,employment_status,industry,company_size_range,team_size_range,women_on_team,remote,job_satisfaction,job_discovery,dev_environment,commit_frequency,hobby,dogs_vs_cats,desktop_os,unit_testing,rep_range,visit_frequency,why_learn_new_tech,education,open_to_new_job,new_job_value,job_search_annoyance,interview_likelihood,how_to_improve_interview_process,star_wars_vs_star_trek,agree_tech,agree_notice,agree_problemsolving,agree_diversity,agree_adblocker,agree_alcohol,agree_loveboss,agree_nightcode,agree_legacy,agree_mars,important_variety,important_control,important_sameend,important_newtech,important_buildnew,important_buildexisting,important_promotion,important_companymission,important_wfh,important_ownoffice,developer_challenges,why_stack_overflow
15731,House ads,Slovakia,Eastern Europe,Eastern Europe,20-24,22.0,Male,Developer; Programmer; Manager; Ninja,other,,2 - 5 years,3.5,Unemployed,,,Android; C; iOS; Java,Android; iOS; Java,Other (please specify),9.0,I'm a student,,,,,,,,Eclipse,A couple times a week,2-5 hours per week,Other (please specify),Windows 10,I don't know,I don't have an account,Once a week,To build a specific product I have in mind,I'm self-taught; Mentorship program (e.g. Flat...,"I'm not actively looking, but I am open to new...",Salary; Work/life balance; Flexible work hours,The interview process,50%,Fewer brainteasers,Star Wars,Agree completely,Agree somewhat,Agree somewhat,Neutral,Disagree completely,Disagree completely,Disagree completely,Neutral,Disagree completely,Disagree completely,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,This is very important,This is somewhat important,This is very important,This is somewhat important,This is somewhat important,This is somewhat important,Poor scheduling; Non-technical management; Int...,To receive help on my personal projects
33799,Meta Stack Overflow Post,Slovakia,Eastern Europe,Eastern Europe,25-29,27.0,Male,Developer; Engineer; Programmer,Mobile developer - Android,"Mobile Dev (Android, iOS, WP & Multi-Platform)",2 - 5 years,3.5,"$20,000 - $30,000",25000.0,,Android; iOS; Java; Swift,Android; iOS; Java; Swift,No,8.0,Employed full-time,Defense,"1,000-4,999 employees",5-9 people,0.0,Never,I'm somewhat satisfied with my job,I knew I wanted to work here. I sought out the...,Sublime; IntelliJ,Multiple times a day,5-10 hours per week,Dogs,Windows 7,I don't know,"1,001 - 5,000",Multiple times a day,To keep my skills up to date,B.S. in Computer Science (or related field); M...,"I'm not actively looking, but I am open to new...",Salary; Ability to make or influence important...,Finding time to search for a job,50%,Offer remote interviews (e.g. via video confer...,Star Wars,Agree completely,Agree somewhat,Agree completely,Agree somewhat,Agree somewhat,Disagree somewhat,Agree somewhat,Agree somewhat,,Neutral,This is somewhat important,This is very important,I don't care about this,This is very important,This is somewhat important,This is somewhat important,This is somewhat important,This is somewhat important,I don't care about this,I don't care about this,Poor scheduling; Limited resources,To get help for my job; To give help to others...


Now we have a better sense of what the data looks like, so we can proceed to picking the columns that we will need for the analysis, given the questions we outlines in [Section 1.](#business). These are the following:

The columns that indicate what language respondants want to learn are the following:
> Survey 2020: LanguageWorkedWith \
> Survey 2019: LanguageWorkedWith \
> Survey 2018: LanguageWorkedWith \
> Survey 2017: HaveWorkedLanguage \
> Survey 2016: tech_do

The columns that indicate how satisfied a respondant is with their job are:
> Survey 2020: JobSat \
> Survey 2019: JobSat \
> Survey 2018: JobSatisfaction \
> Survey 2017: JobSatisfaction \
> Survey 2016: job_satisfaction

The columns that indicate education status are the following:
> Survey 2020: EdLevel \
> Survey 2019: EdLevel \
> Survey 2018: FormalEducation \
> Survey 2017: FormalEducation \
> Survey 2016: education

The columns that indicate where the respondant lives are:
> Survey 2020: Country \
> Survey 2019: Country \
> Survey 2018: Country \
> Survey 2017: Country \
> Survey 2016: country

The columns that indicate what is the respondant's developer status are:
> Survey 2020: Gender \
> Survey 2019: Gender \
> Survey 2018: Gender \
> Survey 2017: Gender \
> Survey 2016: gender

The columns that indicate what is the respondant's employment status are:
> Survey 2020: Employment \
> Survey 2019: Employment \
> Survey 2018: Employment \
> Survey 2017: EmploymentStatus \
> Survey 2016: employment_status


Lastly, we can take a loop at the shapefiles we imported:

In [None]:
# Head of shapefiles
map_df.head()

So the countries are in the variable CNTRY_NAME. We will need to match these with our data and for that we will have to harmonize all the country names in [Section 3.](#prepare)

Given all of the above we can proceed to prepare our data!

<a name="prepare"></a>
## 3. Prepare data

Since we already know which columns we will need in order to answer our questions we can first start by droping columns that are not relevant to us:

In [8]:
# Put relevant variables in list
keep_2020 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2019 = ['LanguageWorkedWith', 'JobSat', 'EdLevel', 'Country', 'Gender', 'Employment']
keep_2018 = ['LanguageWorkedWith', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'Employment']
keep_2017 = ['HaveWorkedLanguage', 'JobSatisfaction', 'FormalEducation', 'Country', 'Gender', 'EmploymentStatus']
keep_2016 = ['tech_do', 'job_satisfaction', 'education', 'country', 'gender', 'employment_status']

# Keep only relevant variables
survey_2020 = survey_2020[keep_2020]
survey_2019 = survey_2019[keep_2019]
survey_2018 = survey_2018[keep_2018]
survey_2017 = survey_2017[keep_2017]
survey_2016 = survey_2016[keep_2016]

Nice! Now we can rename the columns so that all dataframes have the same names for variables

In [9]:
# Rename columns
survey_2020.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2019.rename(columns={'LanguageWorkedWith': 'languages', 'JobSat': 'job_satisfaction', 
                           'EdLevel': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2018.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'Employment': 'employment'}, inplace = True)
survey_2017.rename(columns={'LanguageWorkedWith': 'languages', 'JobSatisfaction': 'job_satisfaction', 
                           'FormalEducation': 'education', 'Country': 'country', 'Gender': 'gender', 
                           'EmploymentStatus': 'employment'}, inplace = True)
survey_2016.rename(columns={'tech_do': 'languages', 'job_satisfaction': 'job_satisfaction', 
                           'education': 'education', 'country': 'country', 'gender': 'gender', 
                           'employment_status': 'employment'}, inplace = True);

Now we need to harmonize the answers to different questions for all survey years in order to merge them and have a complete data set. Let's with some an easy one and look at the category for gender in each year.

In [10]:
# Print unique gender categories in 2020
survey_2020['gender'].unique()

array(['Man', nan, 'Woman',
       'Man;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man'], dtype=object)

In [11]:
# Print unique gender categories in 2019
survey_2019['gender'].unique()

array(['Man', nan, 'Woman',
       'Non-binary, genderqueer, or gender non-conforming',
       'Woman;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man;Non-binary, genderqueer, or gender non-conforming',
       'Woman;Man',
       'Man;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [12]:
# Print unique gender categories in 2018
survey_2018['gender'].unique()

array(['Male', nan, 'Female',
       'Female;Male;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Male',
       'Male;Non-binary, genderqueer, or gender non-conforming',
       'Non-binary, genderqueer, or gender non-conforming', 'Transgender',
       'Female;Transgender',
       'Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Female;Non-binary, genderqueer, or gender non-conforming',
       'Female;Transgender;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender', 'Female;Male;Transgender',
       'Female;Male;Non-binary, genderqueer, or gender non-conforming',
       'Male;Transgender;Non-binary, genderqueer, or gender non-conforming'],
      dtype=object)

In [13]:
# Print unique gender categories in 2017
survey_2017['gender'].unique()

array(['Male', nan, 'Female', 'Gender non-conforming', 'Other',
       'Male; Gender non-conforming', 'Female; Transgender',
       'Male; Female', 'Male; Other', 'Transgender',
       'Transgender; Gender non-conforming',
       'Female; Gender non-conforming',
       'Male; Female; Transgender; Gender non-conforming; Other',
       'Male; Female; Transgender', 'Male; Female; Other',
       'Male; Female; Transgender; Gender non-conforming',
       'Male; Transgender', 'Female; Transgender; Gender non-conforming',
       'Gender non-conforming; Other',
       'Male; Female; Gender non-conforming', 'Female; Other',
       'Male; Transgender; Gender non-conforming', 'Transgender; Other',
       'Male; Gender non-conforming; Other',
       'Female; Gender non-conforming; Other',
       'Male; Female; Gender non-conforming; Other',
       'Female; Transgender; Other',
       'Female; Transgender; Gender non-conforming; Other',
       'Male; Transgender; Other', 'Male; Female; Transgender;

In [14]:
# Print unique gender categories in 2016
survey_2016['gender'].unique()

array(['Male', nan, 'Female', 'Prefer not to disclose', 'Other'],
      dtype=object)

Given what we see above, let's cluster all in the following four categories: female, male, other, nan. We can define a function to assing the value of Male, Female, Other or nan.

In [None]:
# Define function to harmonize gender
def harmonize_gender(df_raw):
    '''This function unifies all gender categories into 
    four: Male, Female, Other and nan. It also creates 
    binary variables for each of the above categories.
    '''
    # Copy df_raw
    df = df_raw.copy()
    # Binary variable for categories
    df['gender_male'] = 0
    df['gender_female'] = 0
    df['gender_other'] = 0
    df['gender_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define gender
        gender = str(df.loc[i, 'gender']).lower()
        # Value if male or man
        if gender == 'male' or gender == 'man':
            df.loc[i, 'gender'] = 'Male'
            df.loc[i, 'gender_male'] = 1
        # Value if female or woman
        elif gender == 'female' or gender == 'woman':
            df.loc[i, 'gender'] = 'Female'
            df.loc[i, 'gender_female'] = 1
        # Assign null values
        elif gender == 'nan':
            df.loc[i, 'gender'] = np.nan
            df.loc[i, 'gender_null'] = 1
        # Other categories lumped into other
        else:
            df.loc[i, 'gender'] = 'Other'
            df.loc[i, 'gender_other'] = 1
    # Return harmonized dataframe
    return(df)

# Apply gender harmonizer
survey_2020 = harmonize_gender(survey_2020)
survey_2019 = harmonize_gender(survey_2019)
survey_2018 = harmonize_gender(survey_2018)
survey_2017 = harmonize_gender(survey_2017)
survey_2016 = harmonize_gender(survey_2016)

We can take a quick look at what the data looks like now:

In [None]:
survey_2020.head()

Similarly, for job satisfaction we can look at the possible values

In [None]:
# Print unique job satisfaction categories in 2020
survey_2020['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2019
survey_2019['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2018
survey_2018['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2017
survey_2017['job_satisfaction'].unique()

In [None]:
# Print unique job satisfaction categories in 2016
survey_2016['job_satisfaction'].unique()

We will try to lump all categories into six categories: Very satisfied, satisfied, Neither, Dissatisfied, Very Dissatisfied and nan. We will take a similar approach to gender and define a function to do this.

In [None]:
trial_list = survey_2017['job_satisfaction'].unique().tolist()
for i in trial_list:
    print(str(i) == '9.0')

In [None]:
# Define function to harmonize job satisfaction
def harmonize_jobsatisfaction(df_raw):
    '''This function harmonizes all the job
    satisfaction responses into: Very satisfied,
    Satisfied, Neither, Dissatisfied, Very dissatisfied
    and nan. It also creates binary variables for each
    of the above categories.'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    v_satisfied = ['very satisfied', 'extremely satisfied', 'i love my job', '10.0', '9.0']
    satisfied = ['slightly satisfied', 'moderately satisfied', 'i\'m somewhat satisfied with my job', '8.0', '7.0']
    neither = ['neither satisfied nor dissatisfied', 'i\'m neither satisfied nor dissatisfied', '6.0', '5.0', '4.0']
    dissatisfied = ['slightly dissatisfied', 'moderately dissatisfied', 'i\'m somewhat dissatisfied with my job', '3.0', '2.0']
    v_dissatisfied = ['very dissatisfied', 'extremely dissatisfied', 'i hate my job', '1.0', '0.0']
    # New binary variables
    df['jobsat_v_satisfied'] = 0
    df['jobsat_satisfied'] = 0
    df['jobsat_neither'] = 0
    df['jobsat_disssatisfied'] = 0
    df['jobsat_v_disssatisfied'] = 0
    df['jobsat_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define job satisfaction
        job_satisfac = str(df.loc[i, 'job_satisfaction']).lower()
        # Value if very satisfied and assign binary variable
        if job_satisfac in v_satisfied:
            df.loc[i, 'job_satisfaction'] = 'Very satisfied'
            df.loc[i,'jobsat_v_satisfied'] = 1
        # Value if satisfied and assign binary variable
        elif job_satisfac in satisfied:
            df.loc[i, 'job_satisfaction'] = 'Satisfied'
            df.loc[i, 'jobsat_satisfied'] = 1
        # Value if neither and assign binary variable
        elif job_satisfac in neither:
            df.loc[i, 'job_satisfaction'] = 'Neither'
            df.loc[i, 'jobsat_neither'] = 1
        # Value if dissatisfied and assign binary variable
        elif job_satisfac in dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Dissatisfied'
            df.loc[i, 'jobsat_dissatisfied'] = 1
        # Value if very dissatisfied and assign binary variable
        elif job_satisfac in v_dissatisfied:
            df.loc[i, 'job_satisfaction'] = 'Very Dissatisfied'
            df.loc[i, 'jobsat_v_dissatisfied'] = 1
        # Other categories become np.nan values
        else:
            df.loc[i, 'job_satisfaction'] = np.nan
            df.loc[i, 'jobsat_null'] = 1
    # Return harmonized dataframe
    return(df)
    
    
# Apply job satisfaction harmonizer
survey_2020 = harmonize_jobsatisfaction(survey_2020)
survey_2019 = harmonize_jobsatisfaction(survey_2019)
survey_2018 = harmonize_jobsatisfaction(survey_2018)
survey_2017 = harmonize_jobsatisfaction(survey_2017)
survey_2016 = harmonize_jobsatisfaction(survey_2016)

Again, we can see that this work by looking at a few rows:

In [None]:
survey_2017['job_satisfaction'].head()

Next, let's look at the employment variables and how they are layed out

In [None]:
# Print unique employment categories in 2020
survey_2020['employment'].unique()

In [None]:
# Print unique employment categories in 2019
survey_2019['employment'].unique()

In [None]:
# Print unique employment categories in 2018
survey_2018['employment'].unique()

In [None]:
# Print unique employment categories in 2017
survey_2017['employment'].unique()

In [None]:
# Print unique employment categories in 2016
survey_2016['employment'].unique()

Now we have to deal with the responses for employment status. This one seems a bit trickier as answer categories have changed over the years. With that in mind, let's create a function to harmonize these categories into the folowing: Full-time, Part-time, Self-employed, Not emplyed, Other and nan.

In [None]:
# Define function to harmonize employment categories
def harmonize_employment(df_raw):
    '''This function harmonizes all employment responses
    into: Full-time, Part-time, Self-employed, Not employed,
    Other and nan. It also creates binary variables for each
    of the above categories.'''
    # Copy df_raw
    df = df_raw.copy()
    # Values to match
    full_time = ['employed full-time']
    part_time = ['employed part-time']
    self_empl = ['independent contractor, freelancer, or self-employed', 'freelance / contractor', 'self-employed']
    not_employed = ['not employed, but looking for work', 'not employed, and not looking for work', 'unemployed']
    other = ['student', 'i\'m a student', 'retired', 'i prefer not to say', 'prefer not to disclose', 'other (please specify)']
    # New binary variables
    df['employment_full_time'] = 0
    df['employment_part_time'] = 0
    df['employment_self_empl'] = 0
    df['employment_not_empl'] = 0
    df['employment_other'] = 0
    df['employment_null'] = 0
    # Loop over rows
    for i in tqdm(df.index):
        # Define employment
        employment = str(df.loc[i, 'employment']).lower()
        # Value if full-time and assign binary variable
        if employment in full_time:
            df.loc[i, 'employment'] = 'Full-time'
            df.loc[i, 'employment_full_time'] = 1
        # Value if part-time and assign binary variable
        elif employment in part_time:
            df.loc[i, 'employment'] = 'Part-time'
            df.loc[i, 'employment_part_time'] = 1
        # Value if self-employed and assign binary variable
        elif employment in self_empl:
            df.loc[i, 'employment'] = 'Self-employed'
            df.loc[i, 'employment_self_empl'] = 1
        # Value if not employed and assign binary variable
        elif employment in not_employed:
            df.loc[i, 'employment'] = 'Not employed'
            df.loc[i, 'employment_not_empl'] = 1
        # Value if other and assign binary variable
        elif employment in other:
            df.loc[i, 'employment'] = 'Other'
            df.loc[i, 'employment_other'] = 1
        # Other categories become np.nan values
        else:
            df.loc[i, 'employment'] = np.nan
            df.loc[i, 'employment_null'] = 1
    # Return harmonized dataframe
    return(df)

# Apply employment harmonizer
survey_2020 = harmonize_employment(survey_2020)
survey_2019 = harmonize_employment(survey_2019)
survey_2018 = harmonize_employment(survey_2018)
survey_2017 = harmonize_employment(survey_2017)
survey_2016 = harmonize_employment(survey_2016)

Now let's look at education variables and their respective values.

In [None]:
# Print unique education categories in 2020
survey_2020['education'].unique()

In [None]:
# Print unique education categories in 2019
survey_2019['education'].unique()

In [None]:
# Print unique education categories in 2018
survey_2018['education'].unique()

In [None]:
# Print unique education categories in 2017
survey_2017['education'].unique()

In [None]:
# Print unique education categories in 2016
survey_2016['education'].unique().tolist()

The answers for 2016 look very different from the previous years. This is probably because respondants were allowed to tick more than one box. We can start to untangle this by putting all possible options in a list called education_options.

In [None]:
# Put education categories into list
education_2016 = survey_2016['education'].unique().tolist()

# Create empty list for possible education options
education_options = []
# Loop over answers and append only unique values
for i in education_2016:
    for opt in str(i).split(';'): # Since options are separated by ;
        # Remove leading white space and append only unique values
        education_options.append(opt.lstrip()) if opt.lstrip() not in education_options else None

This gives us the following available options for respondants

In [None]:
education_options

Now, we want to categorize people into the following categories: Primary education, Secondary education, Some college, Bachelor's , Professional degree, Master's, Doctorates.

Lastly, in order to be able to match the survey data we have with the geodata we imported we need to harminze the name of the countries. In order to do that, we can use the [country_converter library](https://pypi.org/project/country-converter/). We define the following function and apply it to the country values

In [None]:
# Define function to convert country name into ISO3
def country_iso3(df_raw, df_type = 'survey'):
    '''This function createts ISO3 country values 
    column'''
    # Check if df_type valid
    #if df_type != 'survey' or df_survey != 'map':
        ## Exception
    # Copy df_raw
    df = df_raw.copy()
    # If survey is passed
    if df_type == 'survey':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'country'])
            # Create to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, to = 'ISO3')
    # If map is passed
    elif df_type == 'map':
        # Loop over row values
        for i in tqdm(df.index):
            # Define country value
            country = str(df.loc[i, 'CNTRY_NAME'])
            # Convert to ISO3
            df.loc[i, 'iso3'] = coco.convert(names = country, to = 'ISO3')
    # Return dataframe
    return(df)

# Convert surveys into ISO3 
survey_2020 = country_iso3(survey_2020, df_type = 'survey')
survey_2019 = country_iso3(survey_2020, df_type = 'survey')
survey_2018 = country_iso3(survey_2020, df_type = 'survey')
survey_2017 = country_iso3(survey_2020, df_type = 'survey')
survey_2016 = country_iso3(survey_2020, df_type = 'survey')

# Convert map into ISO3
map_df = country_iso3(map_df, df_type = 'map')

In [23]:
# Make list of unique country names
country_list_2020 = survey_2020['country'].unique().tolist()
country_list_2019 = survey_2019['country'].unique().tolist()
country_list_2018 = survey_2018['country'].unique().tolist()
country_list_2017 = survey_2017['country'].unique().tolist()
country_list_2016 = survey_2016['country'].unique().tolist()

# Define function to retrieve non-matches
def no_match_alpha3(country_list):
    '''This function tries to match countries in country list and
    returns list with non-matched values to be reviewed'''
    no_match = []
    for country in country_list:
        try:
            countries.search_fuzzy(str(country))[0].alpha_3
        except:
            no_match.append(country)
    return(no_match)
    
# Get non-matched lists
no_match_2020 = no_match_alpha3(country_list_2020)
no_match_2019 = no_match_alpha3(country_list_2019)
no_match_2018 = no_match_alpha3(country_list_2018)
no_match_2017 = no_match_alpha3(country_list_2017)
no_match_2016 = no_match_alpha3(country_list_2016)

Now that we know which countries are not being matched we can edit their names exactly to get a perfect match.

In [25]:
no_match_2016

['Antigua & Deps',
 'Bosnia Herzegovina',
 'Ireland {Republic}',
 'Ivory Coast',
 'Korea South',
 'Laos',
 'Myanmar, {Burma}',
 'Other (please specify)',
 'Sao Tome & Principe',
 'Korea North',
 'St Kitts & Nevis',
 'Trinidad & Tobago',
 'East Timor']

In [None]:
countries.search_fuzzy('Venezuela')

With that in hands we can add a variable at the end of each dataset to mark the year it represents and merge them

In [None]:
# Add year variable to dataframes
survey_2020['year'] = 2020
survey_2019['year'] = 2019
survey_2018['year'] = 2018
survey_2017['year'] = 2017
survey_2016['year'] = 2016

# Merge datasets into one
data = [survey_2020, survey_2019, survey_2018, survey_2017, survey_2016]
survey = pd.concat(data, ignore_index = True)

<a name="model"></a>
## 4. Data modeling

Now that we cleaned and organized our data, we can proceed to answer the questions proposed in [Section 1.](#business).

> What are the most popular programming languages over the past five years?\
> What countries have more female respondants?\
> Are male respondants happier than female respondants?\
> What countries have the higher job satisfaction rates?

### What are the most popular programming languages over the past five years?

Our first question requires us to look at what languages the respondants said they knew how to use and analyze how this has changed over the years.

### What countries have more female respondants?

A big problem in tech (and many other industries) is barrier many women face to get into this industry. We can take a look at how the composition of respondants of the Stack Overflow Annual Developer has changed in order to have an idea if more women are participating in the most important forum for programmers.

In [None]:
# Define survey with average of gender categories
df_gender = survey[['year', 'gender_male', 'gender_female', 'gender_other', 'gender_null']].groupby('year', as_index = False).mean()

# Print head
df_gender.head()

Above we can see a table with the composition of respondants by gender for the past five years. However, it might be easier to understand what is happening with a graph.

In [None]:
# Set figure size
plt.figure(figsize=(12,8))

# Define graph for each gender category
sns.lineplot(x = 'year', y = 'gender_female', data = df_gender, legend='brief', marker = 'o', label = 'Female')
sns.lineplot(x = 'year', y = 'gender_male', data = df_gender, legend='brief', marker = 'o', label = 'Male')
sns.lineplot(x = 'year', y = 'gender_other', data = df_gender, legend='brief', marker = 'o', label = 'Other')
sns.lineplot(x = 'year', y = 'gender_null', data = df_gender, legend='brief', marker = 'o', label = 'Not declared')

# Set details of plot
plt.title('Gender of respondants', fontsize = 16)
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Percentage", fontsize = 14)
plt.xticks(df_gender['year'])
plt.yticks([0,.2,.4, .6, .8, 1])
plt.gca().spines['bottom'].set_position(('data',0))
plt.legend(loc = 'center right', frameon = False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# Plot graph
plt.show();

I looks like women have constantly been under 10% of the respondant pool. It is worth noting, however, that many people choose to not declare their gender. We might want to look at what the composition of male, female and other are only among those people who chose to declare their gender.

In [None]:
# Define gender adjusted dataset
df_gender_adj = df_gender[['year', 'gender_male', 'gender_female', 'gender_other']]

# Set sum of relevat variables
sum_gender = df_gender_adj[['gender_male', 'gender_female', 'gender_other']].sum(axis=1)

# Adjust categories by only those who declared their gender
df_gender_adj['gender_male'] = df_gender_adj['gender_male']/sum_gender
df_gender_adj['gender_female'] = df_gender_adj['gender_female']/sum_gender
df_gender_adj['gender_other'] = df_gender_adj['gender_other']/sum_gender

# Print adjusted gender distributions
df_gender_adj.head()

With this in hands we can reproduce the graph we did before.

In [None]:
# Set figure size
plt.figure(figsize=(12,8))

# Define graph for each gender category
sns.lineplot(x = 'year', y = 'gender_female', data = df_gender_adj, legend='brief', marker = 'o', label = 'Female')
sns.lineplot(x = 'year', y = 'gender_male', data = df_gender_adj, legend='brief', marker = 'o', label = 'Male')
sns.lineplot(x = 'year', y = 'gender_other', data = df_gender_adj, legend='brief', marker = 'o', label = 'Other')

# Set details of plot
plt.title('Gender of respondants', fontsize = 16)
plt.xlabel("Year", fontsize = 14)
plt.ylabel("Percentage", fontsize = 14)
plt.xticks(df_gender['year'])
plt.yticks([0,.2,.4, .6, .8, 1])
plt.gca().spines['bottom'].set_position(('data',0))
plt.legend(loc = 'center right', frameon = False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# Plot graph
plt.show();

This is no surprise as men did constitute the majority of respondants before. This exercise does indicate, however, that there is a lack of participation of women (and other gender identifications) in the Stack Overflow Annual Developer Survey, which could indicate a wider trend in the tech industry that needs to be addressed. Ideally we would want a higher participation of women in the tech industry.

### Are male respondants happier than female respondants?

Seeing that the majority of survey respondants are men, we could check if this translates into men having a higher job satisfaction than women and other gender identities.

In [None]:
score_jobsat = survey[['job_satisfaction', 'gender', 'year']].copy()

In [None]:
score_jobsat['year'].unique()

In [None]:
# Create new satisfaction score variable
score_jobsat['satisfaction_score'] = np.nan

# Drop rows with null scores
score_jobsat = score_jobsat.dropna(subset = ['job_satisfaction'])

# Loop rows to assign score for score
for i in tqdm(score_jobsat.index):
    if str(score_jobsat.loc[i, 'job_satisfaction']) == 'Very satisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 5
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Satisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 4
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Neither':
        score_jobsat.loc[i, 'satisfaction_score'] = 3
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Dissatisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 2
    elif str(score_jobsat.loc[i, 'job_satisfaction']) == 'Very Dissatisfied':
        score_jobsat.loc[i, 'satisfaction_score'] = 1


In [None]:
#df_jobsat = score_jobsat[['year', 'gender_male', 'gender_female', 'gender_other', 'gender_null']].groupby('year', as_index = False).mean()


In [None]:
# Set figure size
plt.figure(figsize = (12,8))

# Define graph for satisfaction per gender over years
g = sns.catplot(x = 'year', y = 'satisfaction_score', hue = 'gender', kind= 'bar', data = score_jobsat)
g._legend.set_title("Gender")
# Set details of plot
plt.title('Satisfaction of respondants by gender (2016 - 2020)', fontsize = 16)
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Satisfaction score', fontsize = 14);

It doesn't look like there is a significantly difference in job satisfaction over the years between genders. Note, however, that it does seem that happiness levels were slightly higher in 2016 if compared to the other years.

### What countries have the highest job satisfaction rates?

<a name="eval"></a>
## 5. Evaluate the results

Text text

<a name="deploy"></a>
## 6. Deploy

Text text

In [None]:
os.getcwd()

In [None]:
import geopandas as gpd

In [None]:
map_df = gpd.read_file('IPUMSI_world_release2017/world_countries_2017.shp')

In [None]:
ax = map_df.plot()
ax.axis('off');

In [None]:
map_df.head()

In [None]:
countries = map_df['CNTRY_NAME'].unique().tolist()

In [None]:
'Bahamas' in countries

In [None]:
countries2 = survey_2020['Country'].unique().tolist()

In [None]:
import country_converter as coco

iso_lst1 = []
iso_lst2 = []

for country in countries:
    iso1 = coco.convert(names=country, to='ISO3')
    iso_lst1.append(iso1)

for country2 in countries2:
    iso2 = coco.convert(names=country2, to='ISO3')
    iso_lst2.append(iso2)


In [None]:
for i in iso_lst2:
    print(i, i in iso_lst1)

In [None]:
'US' in iso_lst1

In [None]:
coco.convert(names='United States of America', to='ISO3')

In [None]:
'USA' in iso_lst2

In [None]:
for i in survey_2020.index:
    survey_2020.loc[i, 'Country'] = coco.convert(names = str(survey_2020.loc[i, 'Country']), to = 'ISO3')

In [None]:
survey_2020.head()