# 1.0 Introduction
##### prepared by [Edward Goh](https://www.kaggle.com/edwars)

The recurring theme of this analysis is to focus on the **salary** or compensation of the participants with regards to various aspects, such as gender, education level, language used, tools used etc. There will be some data exploration that might not be focused on wage but categories that I find interesting myself, such as Age vs Job title, Distribution of job title categorized by gender.

## 1.1 Objectives:
The objectives are to explore if there exists any relationship between the compensation(AKA salary) and other data that is collected in the dataset. The visualizations that are of interest listed down below, and there will be headers referring to these visualizations respectively.


1. Influence of Gender on Salary
2. Job Distribution with respect to Gender
3. Age Group distribution in Various Job Positions
4. Average Salary changes with respect to Age Group
4. Average Salary with respect to education level
6. Average Salary trend with respect to Coding Experience
7. No. of language vs Wage
8. No. of visualization vs Wage 
8. No. of ML_framework 
8. No. of Learning resources vs wage


### 1.1.1 Glossary

For the sake of easier discussion, the term "salary" will be used to refer to phrases such as "wage", "compensation". In this report, the term "Data Workers",for lack of a better term, will be used as a broader term to include all of the different job scopes, such as data scientists, data analysts, data engineers ... etc. 

## 1.2 Data Cleaning

We will first need to clean the data, and transform the data in a form that is easier to work with. We will be renaming all the columns to more eaily understandable and accessible texts as column headers. Other than that,we will be doing some replacement of values in the dataframe.

In [175]:
# Importing libraries

import pandas as pd
import cufflinks as cf
# import matplotlib.pyplot as plt
# import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, download_plotlyjs, plot, iplot
init_notebook_mode(connected=True)
import numpy as np

%matplotlib inline

In [176]:
#Reading the dataset
responses_2021 = './data/kaggle_survey_2021_responses.csv'
df = pd.read_csv(responses_2021, low_memory=False)

In [177]:
#Dropping the questions row
df = df.drop(0)

#Selecting rows that are relevant to my analysis
cols_to_keep = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6',
                'Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5', 'Q7_Part_6', 'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10', 'Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER',
                'Q14_Part_1', 'Q14_Part_2', 'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6', 'Q14_Part_7', 'Q14_Part_8', 'Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11', 'Q14_OTHER',
                'Q16_Part_1', 'Q16_Part_2', 'Q16_Part_3', 'Q16_Part_4', 'Q16_Part_5', 'Q16_Part_6', 'Q16_Part_7', 'Q16_Part_8', 'Q16_Part_9', 'Q16_Part_10', 'Q16_Part_11', 'Q16_Part_12', 'Q16_Part_13', 'Q16_Part_14', 'Q16_Part_15', 'Q16_Part_16', 'Q16_Part_17', 'Q16_OTHER',
                'Q20', 'Q25',
                'Q40_Part_1', 'Q40_Part_2', 'Q40_Part_3', 'Q40_Part_4', 'Q40_Part_5', 'Q40_Part_6', 'Q40_Part_7', 'Q40_Part_8', 'Q40_Part_9', 'Q40_Part_10', 'Q40_Part_11', 'Q40_OTHER']

df = df[cols_to_keep]

#Renaming all remaining columns into something easy to understand
df = df.rename(columns=
          {   "Q1": "age", 
              "Q2": "gender",
              'Q3': "country",
              'Q4': 'education_level',
              'Q5': 'recent_job_title',
              'Q6': 'coding_exp_year',
              'Q7_Part_1':'language_python',
              'Q7_Part_2':'language_r',
              'Q7_Part_3':'language_sql',
              'Q7_Part_4':'language_c',
              'Q7_Part_5':'language_cpp',
              'Q7_Part_6':'language_java',
              'Q7_Part_7':'language_js',
              'Q7_Part_8':'language_julia',
              'Q7_Part_9':'language_swift',
              'Q7_Part_10':'language_bash',
              'Q7_Part_11':'language_matlab',
              'Q7_Part_12':'language_none',
              'Q7_OTHER':'language_other',
              'Q14_Part_1':'dataviz_mpl',
              'Q14_Part_2':'dataviz_sns',
              'Q14_Part_3':'dataviz_plotly',
              'Q14_Part_4':'dataviz_ggplot',
              'Q14_Part_5':'dataviz_shiny',
              'Q14_Part_6':'dataviz_d3js',
              'Q14_Part_7':'dataviz_altair',
              'Q14_Part_8':'dataviz_bokeh',
              'Q14_Part_9':'dataviz_geoplotlib',
              'Q14_Part_10':'dataviz_leaflet_folium',
              'Q14_Part_11':'dataviz_none',
              'Q14_OTHER':'dataviz_other',
              'Q16_Part_1':'ml_framework_sklearn',
              'Q16_Part_2':'ml_framework_tensorflow',
              'Q16_Part_3':'ml_framework_keras',
              'Q16_Part_4':'ml_framework_pytorch',
              'Q16_Part_5':'ml_framework_fastai',
              'Q16_Part_6':'ml_framework_mxnet',
              'Q16_Part_7':'ml_framework_xgboost',
              'Q16_Part_8':'ml_framework_lightgbm',
              'Q16_Part_9':'ml_framework_catboost',
              'Q16_Part_10':'ml_framework_prophet',
              'Q16_Part_11':'ml_framework_h2o3',
              'Q16_Part_12':'ml_framework_caret',
              'Q16_Part_13':'ml_framework_tidymodels',
              'Q16_Part_14':'ml_framework_jax',
              'Q16_Part_15':'ml_framework_pytorch_lightning',
              'Q16_Part_16':'ml_framework_hugging_face',
              'Q16_Part_17':'ml_framework_none',
              'Q16_OTHER':'ml_framework_other',
              'Q20':'industry',
              'Q25':'compensation',
              'Q40_Part_1':'learning_coursera',
              'Q40_Part_2':'learning_edx',
              'Q40_Part_3':'learning_kagglelearn',
              'Q40_Part_4':'learning_datacamp',
              'Q40_Part_5':'learning_fastai',
              'Q40_Part_6':'learning_udacity',
              'Q40_Part_7':'learning_udemy',
              'Q40_Part_8':'learning_linkedinlearning',
              'Q40_Part_9':'learning_cloud_certification_program',
              'Q40_Part_10':'learning_university',
              'Q40_Part_11':'learning_none',
              'Q40_OTHER':'learning_other'
          })


We will be handling the multiple selection columns by changing all the NaN values into 0 and truthy values into 1. This will be easier for downstream transformations and some calculated fields that I have in mind. 

We will also be replacing the salary ranges with the average value of the range itself. This replaces the string value into a numerical value that we can perform better analysis. Although this introduces some inaccuracies into the dataset, but it will be much easier to work with and enable us to do numerical analysis downstream.


In [178]:
#Handling the columns -  differentiating between multiple selection and single answer questions
single_answer_cols = ['age', 'gender', 'country', 'education_level', 'recent_job_title',
       'coding_exp_year','industry', 'compensation']

df_single_choice = df[single_answer_cols]

df_multiple_choice = df.drop(columns=single_answer_cols)

df_multiple_choice = df_multiple_choice.notnull().astype('int')
df = pd.merge(df_single_choice, df_multiple_choice, on=df_single_choice.index, right_index=False).drop(columns='key_0')

In [179]:
#Handling the compensation, replacing into actual values instead of leaving it as a string

compensation_lookup = {'25,000-29,999': (25000+30000)/2,
                       '60,000-69,999': (60000+70000)/2,
                       '$0-999': (0+1000)/2,
                       '30,000-39,999': (30000+40000)/2, 
                        np.nan: 0,
                       '15,000-19,999': (15000+20000)/2, 
                       '70,000-79,999': (70000+80000)/2, 
                       '2,000-2,999': (2000+3000)/2, 
                       '10,000-14,999': (10000+15000)/2,
                       '5,000-7,499': (5000+7500)/2, 
                       '20,000-24,999': (20000+25000)/2, 
                       '1,000-1,999': (1000+2000)/2, 
                       '100,000-124,999': (100000+125000)/2,
                       '7,500-9,999': (75000+10000)/2, 
                       '4,000-4,999': (4000+5000)/2, 
                       '40,000-49,999': (40000+50000)/2, 
                       '50,000-59,999': (50000+60000)/2,
                       '3,000-3,999': (3000+4000)/2, 
                       '300,000-499,999': (300000+500000)/2, 
                       '200,000-249,999': (200000+250000)/2,
                       '125,000-149,999': (125000+150000)/2,
                       '250,000-299,999': (250000+300000)/2, 
                       '80,000-89,999': (80000+90000)/2,
                       '90,000-99,999': (90000+100000)/2, 
                       '150,000-199,999':(150000+200000)/2, 
                       '>$1,000,000': 1000000,
                       '$500,000-999,999': (500000+1000000)/2}

df.compensation = df.compensation.replace(compensation_lookup)

We are also rephrasing some very long answers to shorter, more manageable phrases that are easier to put into visualizations. The replacement phrases are selected to be as close as it can be in meaning to the original answer. 

In [180]:
# Rephrasing some of the options to show it better on visualization

education_level_lookup = {'I prefer not to answer':'Not Disclosed',
                         'Some college/university study without earning a bachelor’s degree':'Partial Tertiary Studies',
                         'No formal education past high school':'High School'}

df.education_level = df.education_level.replace(education_level_lookup)

Finally, we are adding some calculated fields, as I am interested in knowing the total no. of languages, data visualization libraries etc. that one participant knows. These calculated fields will be used in downstream analysis.

The sum of the calculated fields excludes the 'none' columns, such as language_none that indicates the participant knows/uses no programming language.

In [181]:
language_cols = ['language_python',
 'language_r',
 'language_sql',
 'language_c',
 'language_cpp',
 'language_java',
 'language_js',
 'language_julia',
 'language_swift',
 'language_bash',
 'language_matlab',
'language_other']

df['language_sum'] = df[language_cols].sum(axis=1)

dataviz_cols = ['dataviz_mpl', 'dataviz_sns', 'dataviz_plotly', 'dataviz_ggplot',
       'dataviz_shiny', 'dataviz_d3js', 'dataviz_altair', 'dataviz_bokeh',
       'dataviz_geoplotlib', 'dataviz_leaflet_folium','dataviz_other']

df['dataviz_sum']  = df[dataviz_cols].sum(axis=1)

ml_framework_cols = ['ml_framework_sklearn', 'ml_framework_tensorflow',
       'ml_framework_keras', 'ml_framework_pytorch', 'ml_framework_fastai',
       'ml_framework_mxnet', 'ml_framework_xgboost', 'ml_framework_lightgbm',
       'ml_framework_catboost', 'ml_framework_prophet', 'ml_framework_h2o3',
       'ml_framework_caret', 'ml_framework_tidymodels', 'ml_framework_jax',
       'ml_framework_pytorch_lightning', 'ml_framework_hugging_face',
       'ml_framework_none', 'ml_framework_other']

df['ml_framework_sum']  = df[ml_framework_cols].sum(axis=1)

learning_cols = ['learning_coursera','learning_edx','learning_datacamp',
       'learning_fastai', 'learning_udacity', 'learning_udemy',
       'learning_linkedinlearning', 'learning_cloud_certification_program',
       'learning_university', 'learning_kagglelearn', 'learning_other']

df['learning_sum']  = df[learning_cols].sum(axis=1)

## 2.0 Visualization and Discussions
After cleaning the data and transforming the dataset, the dataset is now ready to be used to create visualizations that will give us answers to the objectives that we set out to discover. Let's begin to create the visualizations.



## 2.1 Influence of Gender on Salary

From the [executive summary](https://www.kaggle.com/kaggle-survey-2021) prepared by Kaggle, we know that in there is still a lot more male data workers than there are female data workers. Let's take a look at the gender distribution and then explore the possibility of a wage gap between different genders.

In [182]:
temp = df[(df.compensation > 0) & (df.gender.notna())] 

temp = temp.groupby('gender').count()
temp.rename(columns={'age':'count'}, inplace=True)


# fig = px.bar(temp, x=temp.index, y='count', title='Gender Distribution in participants', text='count')
# fig.update_layout(xaxis_title='Gender', yaxis_title='No. of People')

fig = px.pie(temp, names=temp.index,values='count', title='Gender Distribution in participants', )
fig.update_layout(xaxis_title='Gender', yaxis_title='No. of People')

In [183]:
temp = df[(df.compensation > 0) & (df.gender.notna())] 

temp = temp.groupby('gender').sum()

fig = px.bar(temp, x=temp.index, y='compensation', title='Salary vs Gender',text='compensation')
fig.update_layout(xaxis_title='Gender', yaxis_title='Salary')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')

fig

From the bar chart, we can see that there is not really a huge discrimination between any particular gender. Interestingly, those who are Non-Binary, actually has a higher salary than other groups of data workers. However, this is not very representative of the entire population, as there is only 47 Non-binary participants compared to the huge 12,642 male participants.

Our objective was to investigate if there was any discrimination against other genders that were not male, and from our data it seems to be not extremely ovbious if we examine the data as a whole(using mean, sum, etc.).


## 2.2 Job Distribution with respect to Gender

This section is one of the sections that are not related to the salary, but is an interesting visualization that would be helpful to look into as well. 

The objective is to look at the gender distribution within each job position itself, and observe if there are any interesting insights.

In [184]:
temp = df[(df.recent_job_title.notna())& (df.gender.notna())] 

# temp = temp.groupby

fig = px.histogram(temp, x='recent_job_title', color='gender', barmode='group', title='Job Position Distribution Vs Gender')
fig.update_layout(xaxis_title='Most Recent Job Title', yaxis_title='No. of Person')

From the visualization, we know ther are a lot of student participating in this survey, and following that, data scientists and software engineers. Let's calculate the male to female ratio in this top 3 job positions.

In [185]:
print(f"The male to female ratio for Students, Data Scientists and Software Engineers, are {round(5093/1574,2)}, {round(2971/584,2)} and {round(2045/366,2)} respectively")

The male to female ratio for Students, Data Scientists and Software Engineers, are 3.24, 5.09 and 5.59 respectively


From this calculation, we can see that the male to female ratio in the students field are actually less than the Data Scientists and Software Engineers field. We can probably expect this male to female ratio to slowly trickle down to other job positions, assuming these students will be pursuing a data science related field after they graduate.

## 2.3 Age Group distribution in Various Job Positions

This is also a section that is not directly related to wage, but also an interesting part of data to explore. Let's take a look at the visualization.

In [186]:
fig = px.histogram(df, color='age', x='recent_job_title', title='Age vs Most Recent Job Position')
fig.update_layout(xaxis_title='Most Recent Job Title', yaxis_title='No. of Person')

From the visualization, we are able to observe that there are, naturally, a lot of students that are in the 18-21 age group, but there are also a impressive amount of 40-44 aged students. Shoutout and kudos to them, as this is an great example of life-long learning. 

Other observations that we can see is that there are a good amount of people who are aged 25-29 in the Data Analyst, Data Scientist, Software Engineer, and ML engineer group. 

Generally, we can see that within the survey participants, the age groups that dominate are from 22-44 ages. There is much less count for age groups above 45 years old. This makes much sense, as data science started becoming popular back in the 2010 ([according to this wiki](https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/A_History_of_Data_Science)), which explains why we are not seeing more mature age groups dominating the field.

# 2.4 Average Salary changes with respect to Age Group

After we have explored and understood the data a bit more in terms of Age group distribution, Gender Distribution and Job distribution in age and gender, let's take our focus back to salary. We shall now create visualizations to explore the relationship between **Age Group and Salary**.

In [187]:
temp = df[df.compensation > 0].groupby('age').mean()

fig = px.histogram(temp, x=temp.index, y='compensation', title='Average Salary vs Age group')
fig.update_layout(xaxis_title='Age group', yaxis_title='Average Salary')

From this visualization, it can be see that there is an gradual increase in the average salary from 18-21 years old all the way up to 45-49 years old. After that, the average salary fluctuates and does not have a definitive upward trend. One of the most significant jump is from the 25-29 age group to 30-34 age group, having around a 16K increase in average salary. 

## 2.5 Average Salary with respect to education level

In this section, we shall explore the correlation of the salary with respect to the participants' education level

In [188]:
temp = df[(df.compensation > 0) & (df.education_level.notna())].groupby('education_level').mean()

fig = px.bar(temp, x=temp.index, y='compensation', title='Salary vs Education Level', 
             category_orders={'education_level':['Not Disclosed','High School', 'Partial Tertiary Studies','Bachelor’s degree','Master’s degree', 'Doctoral degree','Professional doctorate']})
fig.update_layout(xaxis_title='Education Level', yaxis_title='Average Salary')

From the visualization, we can infer that there is a salary gap starting between the Bachelor's degree and the Master's Degree. There exists, another gap between the master's degree and the Doctoral and Professional Doctorates. 

However, I think one important insight that can be drawn from this visualization, is that there is no significant difference in wage between high school, partial tertiary studies and a formal bachelor's degree. There can be a few ways to discuss this finding. 

Firstly, it can be safe to say that with sufficient experience and self-learning, we should be able to narrow the gap between a self-learned data analyst and a data analyst with formal education. Other than that, we also do not know if the participants who chose Bachelor's Degree, necessarily has a bachelor's degree in data analyst or data science itself. It might be in other fields and they are also self-learners. 

With this, let us end this section and continue to explore if our assumption that experience is a more valuable asset compared to education levels in the next section.

## 2.6 Coding Experience effects on Salary

Continuing from the last section, we will now look at visualizations that show us the salary trend with respect to coding experience. 

In [189]:
temp = df[(df.compensation > 0) & (df.coding_exp_year.notna())].groupby('coding_exp_year').mean()

fig = px.histogram(temp, x=temp.index, y='compensation', 
                   title='Average Salary vs Coding Experience',
                   category_orders={'coding_exp_year':['I have never written code','< 1 years','1-3 years','3-5 years','5-10 years','10-20 years','20+ years']})
fig.update_layout(xaxis_title='Experience of Coding(Years)', yaxis_title='Average Salary')

We can see a clear pattern in this visualization, where there is a substantial increase in salary with respect to the experience that you have in coding. This is very encouraging to new users, as we now know that, data workers' experience are valued by the industry and employer, and is reflected, at least in terms of monetary value.

## 2.7 Average Salary with respect to language

In this section, we shall explore the changes in salary with respect to programming languages. There are 2 ways that I would like to explore this section. First, we will be looking at the average salary within each language, and then I would like to explore if there is an incentive that can be seen in the salary if we know more languages. 

*This format will be applied when we explore data visualization libraries, machine learning frameworks and learning platforms used.* 

Let's start to visualize our data.

In [190]:
language_average_salary = {}
language_average_salary['language'] = []
language_average_salary['salary'] = []
language_average_salary['count'] = []


for language in language_cols:
    temp = df[df[language] == 1]
    language_average_salary['language'].append(language)
    language_average_salary['salary'].append(temp.compensation.mean())
    language_average_salary['count'].append(temp.compensation.count())


temp = pd.DataFrame(data=language_average_salary)

px.bar(temp, x='language', y=['salary','count'],barmode='group',title='Average Salary of Each Programming Language')

In [191]:
temp = df.groupby(by='language_sum').mean()

fig = px.bar(temp, x=temp.index, y='compensation',title='No. of Languages used vs Average Salary')
fig.update_layout(xaxis_title='No. Of Languages Used', yaxis_title='Average Salary')
fig

The count of the people using the language was added in the first visualization, so that we have a sense of the proportion that average salary with regards to the no . of users of the langauges. It can be seen that while the three highest average salary is accounted to Swift, Julia and bash. The user count is 242, 305 and 2216 respectively. 

It might be due to the scarcity of the users in these languages that warrant such a high salary. Something to note is that in this visualization, although it is being categorized in each language, the language known is not limited to 1 language only. The survey question accepts multiple selections. This means that, the participants who know python, might also know swift, vice versa. Bearing this in mind, it would not be wise to prioritize learning one language only based solely on this visualization. 

We can, however infer loosely that **knowing languages such as Python, R and SQL, possibly warrants higher salaries than languages like C, C++, java nad JavaScript in data science applications**. 

Turning our focus to the No. of Languages used, we can see that there is also not a clear trend between average salary and no. of languages used. THere is a huge spike at 11 no. of languages, but it would possible not be worth it to spend time to be learning 11 languages. Time would probably be better spent on honing the few languages that you already know, and other data analysis skills.

# 2.8 Average Salary vs Data Visualization Library

We will be exploring data visualization tools in the same format as we did for programming languages, we will explore the average salary for each data visualization library, then look at if knowing more data visualization library affects our salary.

Let's take a look at the graphs.

In [192]:
dataviz_average_salary = {}
dataviz_average_salary['dataviz'] = []
dataviz_average_salary['salary'] = []
dataviz_average_salary['count'] = []


for dataviz in dataviz_cols:
    temp = df[df[dataviz] == 1]
    dataviz_average_salary['dataviz'].append(dataviz)
    dataviz_average_salary['salary'].append(temp.compensation.mean())
    dataviz_average_salary['count'].append(temp.compensation.count())


temp = pd.DataFrame(data=dataviz_average_salary)

px.bar(temp, x='dataviz', y=['salary','count'],barmode='group',title='Average Salary of Each Data Visulization Library')

In [193]:
temp = df.groupby(by='dataviz_sum').mean()

fig = px.bar(temp, x=temp.index, y='compensation',title='No. of Data Visualization Libraries used vs Average Salary')
fig.update_layout(xaxis_title='No. Of DataVis libraries Used', yaxis_title='Average Salary')
fig

When we visualize the Average Salary for each Data visualization library, it gives a similar story for when we explored individual programming language in the earlier section. The highest paid data visualization libraries also has a relatively small amount of use base involved.

We will make the same loose assumption that users who know ggplot and plotly will have higher salary than those who use matplotlib and seaborn. 

However, when we inspect the average salary associated with no. of data visualization tools used, we notice a few interesting findings:

1. There is a salary difference if you don't know any visualization libraries at all
2. There is a rising salary trend starting from **3 data visualization libraries and above**
3. There is a significant increase in salary for participants knowing **more than 8 data visualization libraries**. 

This is very different for when we explored the languages section. We will continue to use the same format to explore the relationship between ML framework and Learning platforms.

## 2.9 Average Salary vs ML Framework

Following the same format as the previous 2 sections, let's take a look at the graphs for ML frameworks vs salary.

In [194]:
ml_framework_average_salary = {}
ml_framework_average_salary['ml_framework'] = []
ml_framework_average_salary['salary'] = []
ml_framework_average_salary['count'] = []


for framework in ml_framework_cols:
    temp = df[df[framework] == 1]
    ml_framework_average_salary['ml_framework'].append(framework)
    ml_framework_average_salary['salary'].append(temp.compensation.mean())
    ml_framework_average_salary['count'].append(temp.compensation.count())


temp = pd.DataFrame(data=ml_framework_average_salary)

px.bar(temp, x='ml_framework', y=['salary','count'],barmode='group',title='Average Salary of Each Learning Platform')

In [195]:
temp = df.groupby(by='ml_framework_sum').mean()

fig = px.bar(temp, x=temp.index, y='compensation',title='No. of Machine Learning Framework used vs Average Salary')
fig.update_layout(xaxis_title='No. Of Machine Learning Framework Used', yaxis_title='Average Salary')
fig

The story is similar to the data visualization section. We can see that some of the most commonly used frameworks has lower average pay. This is highly possible because these commonlyused frameworks has a large population of users, including both entry-level data workers and veterans. Some frameworks with less users has a significantly higher pay, but it is not definitive as previously discussed in the languages section.

However, if there is one thing that we can learn from analyzing the programming language, data visualization libraries, and ML frameworks, that is that certain frameworks/tools that have considerably less user bases, warrants high salaries. It is highly possible, that these users are also well versed in more common frameworks and tools. 

Coming back to ML frameworks, it can be seen that there is a even more clear trend of increasing salary for knowing 1-11 ML frameworks. After 11 frameworks, the trend is not clear.

# 2.10 Learning Platforms

Last but not least, I would like to explore the learning platforms and their association with salary. Let us proceed to the visualizations.

In [196]:
learning_average_salary = {}
learning_average_salary['learning_platform'] = []
learning_average_salary['salary'] = []
learning_average_salary['count'] = []


for platform in learning_cols:
    temp = df[df[platform] == 1]
    learning_average_salary['learning_platform'].append(platform)
    learning_average_salary['salary'].append(temp.compensation.mean())
    learning_average_salary['count'].append(temp.compensation.count())


temp = pd.DataFrame(data=learning_average_salary)

px.bar(temp, x='learning_platform', y=['salary','count'],barmode='group',title='Average Salary of Each Learning Platform')

In [197]:
temp = df.groupby(by='learning_sum').mean()

fig = px.bar(temp, x=temp.index, y='compensation',title='No. of Learning Platform used vs Average Salary')
fig.update_layout(xaxis_title='No. Of Learning Platform Used', yaxis_title='Average Salary')
fig

It can be seen that fast.ai, edX and udacity learners have higher salaries. This is an interesting finding, because fast.ai and edX are free/semi-free learning platforms, while udacity is one of the more expensive learning platforms available. They represent 2 extremes in terms of pricing. 

From our second visualization showing the effects of No. of Learning platform used on salary, there is but one especially interesting finding and encouraging finding. There is a clear difference between users who do learn ( any value > 1) and users who do not. 

However, more learning platforms is not necessarily more beneficial towards increasing your salary. This is natural, as more learning platforms don't mean that you learn more. Learners will probably benefit more, by having a higher level of focus and dedication on one or two platforms that provide quality learning content, rather than having more learning platforms.

# 3.0 Conclusion


Through the various visualizations that we have explored in this notebook, all of the findings are converged in the list below:

1. No obvious wage gap exists due to gender
2. No particular job position has any gender biases. However, with a lower male-female ratio in students, we should be able expect this lower male-female ratio to trickle to workplaces as time goes by, narrowing the gender gap.
3. Participants Aged 45 and below make up the majority of this survey, and in turn, a significant portion of the data community.
4. There is **an increasing trend of salary from age 22 --> 45**.
5. There is a salary gap between bachelor's degree and master's degree, and another gap between master's degree and doctorates. There is no siginificant difference between other education levels, from high school all the way up to bachelor's degree.
6. There is a clear, increasing trend of salary with increasing coding experience.
7. There are not one language that can be confirmed to warrant higher pay, and there is no trend that suggests higher salary is associated with knowing more no. of languages.
8. On the other hand, knowing more data visualization libraries, does have a clear increasing trend in salary. This increasing trend is obvious in for knowing 2-7 data visualization libraries.
9. The same can be said for ML frameworks, with a clear increase of salary from knowing 1 - 11 ML frameworks.
10. Users who do use learning platforms do have higher salaries than users who do not. However, there is no correlation between using more learning platforms and salary.


Please leave your comments down below to let me know where this notebook can be improved. Do give me an upvote if you liked my work! Thank you!

Prepared by Edward Goh [Kaggle](https://www.kaggle.com/edwars) [Github](https://github.com/edwarsgts/DataAnalysisPortfolio)

