# Data exploration on Stack Overflow Annual Developer Survey for 2022 
## Part 1

Datasets: https://insights.stackoverflow.com/survey

Goal: Get familiar with and explore the dataset for 2022

## Get familiar with data structure


1. survey_results_public.csv - CSV file with main survey results, one respondent per row and one column per answer
2. survey_results_schema.csv - CSV file with survey schema, i.e., the questions that correspond to each column name

TIPS: 
- load both csv as DF and explore data (describe, info)
- use Series str method ((https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html)) to access Python's string functions 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../datasets/stack-overflow-developer-survey/survey_results_public.csv')
df.head(3)

Unnamed: 0,ResponseId,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,...,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,None of these,,,,,,,,,...,,,,,,,,,,
1,2,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects,,,,,,...,,,,,,,,Too long,Difficult,
2,3,"I am not primarily a developer, but I write co...","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Programming Game...,,14.0,...,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0


In [3]:
# check the shape (number of rows, number of colummnms)
df.shape

(73268, 79)

In [4]:
# let's take a look at the column names - each column represent a question in the survay
df.columns

Index(['ResponseId', 'MainBranch', 'Employment', 'RemoteWork',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'LearnCodeCoursesCert', 'YearsCode', 'YearsCodePro', 'DevType',
       'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'Country', 'Currency',
       'CompTotal', 'CompFreq', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'NEWCollabToolsHaveWorkedWith',
       'NEWCollabToolsWantToWorkWith', 'OpSysProfessional use',
       'OpSysPersonal use', 'VersionControlSystem', 'VCInteraction',
       'VCHostingPersonal use', 'VCHostingProfessional use',
       'OfficeStackAsyncHaveWorkedWith', 'OfficeStackAsyncWantToWorkWith',
       'OfficeStackSyncHaveWork

In [5]:
# we'll need the schema file, in order to map column names to questions in survay
df_schema = pd.read_csv('../datasets/stack-overflow-developer-survey/survey_results_schema.csv')
df_schema.head()

Unnamed: 0,qid,qname,question,force_resp,type,selector
0,QID16,S0,"<div><span style=""font-size:19px;""><strong>Hel...",False,DB,TB
1,QID12,MetaInfo,Browser Meta Info,False,Meta,Browser
2,QID1,S1,"<span style=""font-size:22px; font-family: aria...",False,DB,TB
3,QID2,MainBranch,Which of the following options best describes ...,True,MC,SAVR
4,QID296,Employment,Which of the following best describes your cur...,False,MC,MAVR


## Which is most popular technology (programming language)

**To answer this question, we must descover which column from survay dataset (df) contains the question.**

So, first, we must find the 'question' which contains words "tchnology" or "language" in df_schema and to get the corresponding 'qname' value


Let's take a look at the questions in schema files and the mapping between the question and the df column names.

In [6]:
# from df_schema get rows 10: 30, columns qname and question 
df_schema.loc[10:30, ['qname', 'question'] ]

Unnamed: 0,qname,question
10,LearnCodeOnline,What online resources do you use to learn to c...
11,LearnCodeCoursesCert,What online courses or certifications do you u...
12,YearsCode,"Including any education, how many years have y..."
13,YearsCodePro,"NOT including education, how many years have y..."
14,DevType,Which of the following describes your current ...
15,OrgSize,Approximately how many people are employed by ...
16,PurchaseInfluence,"What level of influence do you, personally, ha..."
17,BuyNewTool,"When buying a new tool or software, how do you..."
18,Country,"Where do you live? <span style=""font-weight: b..."
19,Currency,Which currency do you use day-to-day? If your ...


Search for 'technology' or 'language' in questions.
We will use [Series.str.contains() method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html)

In [7]:
mask1 = df_schema.question.str.contains('language', case=False) 
mask2 = df_schema.question.str.contains('technology', case=False) 

# element-wise OR in Pandas is marked with '|', not 'or' 
tech_questions = df_schema.loc[mask1 | mask2, ['qname', 'question']]
tech_questions

Unnamed: 0,qname,question
0,S0,"<div><span style=""font-size:19px;""><strong>Hel..."
16,PurchaseInfluence,"What level of influence do you, personally, ha..."
22,S3,"<span style=""font-size:22px; font-family: aria..."
23,Language,"Which <b>programming, scripting, and markup la..."
28,ToolsTech,Which <b>developer </b><strong>tools</strong> ...


Lets print the questions only, using the row indexes above. We can get indexes of a DF using the [DataFrame.index property](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html)

In [8]:
indexes = tech_questions.index

for idx in indexes:
	print(f'\n\tQuestion Name:{df_schema.loc[idx, "qname"]}:')
	print(f'{df_schema.loc[idx, "question"]}')


	Question Name:S0:
<div><span style="font-size:19px;"><strong>Hello world! </strong></span></div>

<div> </div>

<div>Thank you for taking the 2022 Stack Overflow Developer Survey, the longest running survey of software developers (and anyone else who codes!) on Earth. </div>

<div> </div>

<div>As in previous years, anonymized results of the survey will be made publicly available under the Open Database License, where anyone can download and analyze the data. On that note, throughout the survey, certain answers you and your peers give will be treated as personally identifiable information, and therefore kept out of the anonymized results file. We'll call out each of those in the survey with a note saying "This information will be kept private." </div>

<div> </div>

<div>There are seven sections in this survey. The 2nd, 3rd, and 4th sections will appear in a random order.</div><div><br></div>

<div>   1. Basic Information</div>

<div>   2. Education, Work, and Career</div>

<div>   3

### Explore data using [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

From analysis aboove we see that we are interested in column name 'Language'. But if we try to find that column in df, we see that there is no such column - obviosly there is a problem in df_schema. As a workaround, lets find the df columns names, which contains string 'Language'

In [9]:
mask = df.columns.str.contains('language', case=False)
df.columns[mask]

Index(['LanguageHaveWorkedWith', 'LanguageWantToWorkWith'], dtype='object')

Now as we now the column name ('LanguageWantToWorkWith'), we want to find the count of unique values in it, in order to find which is the language which develepoers wants to work with.

In [10]:
df.LanguageWantToWorkWith.value_counts(dropna=False)

NaN                                                             6241
Python                                                          1021
HTML/CSS;JavaScript;TypeScript                                   945
Rust                                                             825
C#                                                               568
                                                                ... 
HTML/CSS;Java;JavaScript;Kotlin;MATLAB;PowerShell;Python           1
C;C#;Dart;F#;HTML/CSS;JavaScript;Python;Rust;SQL;TypeScript        1
C;Clojure;Elixir;Go;JavaScript;Rust;TypeScript                     1
HTML/CSS;Java;Kotlin;Python;Scala;SQL                              1
Bash/Shell;C#;HTML/CSS;JavaScript;Perl;PowerShell;TypeScript       1
Name: LanguageWantToWorkWith, Length: 23953, dtype: int64

Problem - we see that these column can contains multiple values, separated by ';' so we'll have to code more. We need to get the counts of of all languages mentiond separatelly.

In [11]:
# we will slit each values separated by ';' into separate column:  
expanded = df.LanguageWantToWorkWith.str.split(';', expand=True)
expanded

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,,,,,,,,,,,...,,,,,,,,,,
1,Rust,TypeScript,,,,,,,,,...,,,,,,,,,,
2,C#,C++,HTML/CSS,JavaScript,TypeScript,,,,,,...,,,,,,,,,,
3,C#,SQL,TypeScript,,,,,,,,...,,,,,,,,,,
4,C#,Elixir,F#,Go,JavaScript,Rust,TypeScript,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73263,Bash/Shell,Go,JavaScript,Python,SQL,TypeScript,,,,,...,,,,,,,,,,
73264,HTML/CSS,JavaScript,Python,,,,,,,,...,,,,,,,,,,
73265,C#,HTML/CSS,JavaScript,PHP,Python,SQL,,,,,...,,,,,,,,,,
73266,Delphi,,,,,,,,,,...,,,,,,,,,,


In [12]:
# in order to find unique value counts for the whole data frame, we will convert it to a single Series object with stack(), and then use value_counts()
expanded.stack().value_counts().head(5)

JavaScript    31551
Python        29350
TypeScript    26050
HTML/CSS      25423
SQL           24804
dtype: int64

# HW: Find top 5 countries from which people have answered the syrvay 

In [13]:
### your code here


United States of America                                13543
India                                                    6639
Germany                                                  5395
United Kingdom of Great Britain and Northern Ireland     4190
Canada                                                   2490
Name: Country, dtype: int64