# Data Science Salaries in 2023

```
```

- The aim of this project is to analyse the trends of the Data Science Industry. Employees can be classified based on their job role, experience level, residence, company, mode of work etc. The goal is to observe the trends by classifying the employees into groups. For example, salary variation with respect to work experience, job role and so on. This will help us get a good insight on the trends in the field of Data Science.

- The dataset has been taken from the following link:
  https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023
  
- For the analysis, I will be using a Google Colab Notebook, with Python as the Progamming Language. The libraries used are the following:

  1. Numpy for array operations.
  2. Pandas for modifying and performing operations on a dataset by converting it into a dataframe.
  3. Matplotlib, plotly and seaborn for plotting visual data.
  4. Country_converter for world map plotting. \
  and some other useful libraries.

```
```

1. The standard technique for data analysis is to first clean, edit and rearrange the data. This involves removing rows with null entries or replacing null values with a calculated estimation, removing unneccessary columns, adding extra columns based on the given columns, splitting/merging the dataset and so on. The goal is to have a dataframe that can be analysed smoothly without any errors.

2. Next, we obtain results by rearranging and performing computations on the dataframe to obtain useful insights.

3. Finally, we represent these insights through visual interpretation, such as using a line graph to show the variation of a column with respect to the other, bar plot to look at the number in each category, heat map to get a feel of the location of the peak/low values of a column and many more.

```
```

The course 'Data Analysis with Python: Zero to Pandas' offered by Jovian has helped me learn Data Analysis from scratch, and all of the information I've described above has been learnt from this course.

---------------------

### How to run the code

This is an executable [*Jupyter notebook*](https://jupyter.org) hosted on [Jovian.ml](https://www.jovian.ml), a platform for sharing data science projects. You can run and experiment with the code in a couple of ways: *using free online resources* (recommended) or *on your own computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on [mybinder.org](https://mybinder.org), a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle".


#### Option 2: Running on your computer locally

1. Install Conda by [following these instructions](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). Add Conda binaries to your system `PATH`, so you can use the `conda` command on your terminal.

2. Create a Conda environment and install the required libraries by running these commands on the terminal:

```
conda create -n zerotopandas -y python=3.8 
conda activate zerotopandas
pip install jovian jupyter numpy pandas matplotlib seaborn opendatasets --upgrade
```

3. Press the "Clone" button above to copy the command for downloading the notebook, and run it on the terminal. This will create a new directory and download the notebook. The command will look something like this:

```
jovian clone notebook-owner/notebook-id
```



4. Enter the newly created directory using `cd directory-name` and start the Jupyter notebook.

```
jupyter notebook
```

You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook file (it has a `.ipynb` extension) to open it.


-----------------------

## Downloading the Dataset

We can directly download a given dataset by *opendatasets* library which takes the url as argument and downloads the file from it.

In [111]:
!pip install jovian opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [112]:
# Change this
dataset_url = 'https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023' 

In [113]:
import opendatasets as od
od.download(dataset_url)

Skipping, found downloaded files in "./data-science-salaries-2023" (use force=True to force download)


The dataset has been downloaded and extracted.

In [114]:
# Change this
data_dir = './data-science-salaries-2023'

In [115]:
import os
os.listdir(data_dir)

['ds_salaries.csv']

Let us save and upload our work to Jovian before continuing.

In [116]:
project_name = "data-science-jobs" # change this (use lowercase letters and hyphens only)

In [117]:
!pip install jovian --upgrade -q

In [118]:
import jovian

In [119]:
jovian.commit(project=project_name)

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


-----------------

## Data Preparation and Cleaning





In [120]:
import numpy as np
import pandas as pd

### Brief description of the dataframe:
1. 'work_year': The year in which the employees' salary has been recorded.

2. 'experience_level': \
SE - Senior \
EN - Entry level \
EX - Executive level \
MI - Mid/Intermediate level \

3. 'employment_type': \
FL - Freelancer \
CT - Contractor \
FT - Fulltime \
PT - Parttime \


In [121]:
data_df = pd.read_csv(data_dir + '/ds_salaries.csv')
data_df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


- List of various categories in each column.

In [122]:
data_df['experience_level'].unique()

array(['SE', 'MI', 'EN', 'EX'], dtype=object)

In [123]:
data_df['employment_type'].unique()

array(['FT', 'CT', 'FL', 'PT'], dtype=object)

In [124]:
data_df['job_title'].unique()

array(['Principal Data Scientist', 'ML Engineer', 'Data Scientist',
       'Applied Scientist', 'Data Analyst', 'Data Modeler',
       'Research Engineer', 'Analytics Engineer',
       'Business Intelligence Engineer', 'Machine Learning Engineer',
       'Data Strategist', 'Data Engineer', 'Computer Vision Engineer',
       'Data Quality Analyst', 'Compliance Data Analyst',
       'Data Architect', 'Applied Machine Learning Engineer',
       'AI Developer', 'Research Scientist', 'Data Analytics Manager',
       'Business Data Analyst', 'Applied Data Scientist',
       'Staff Data Analyst', 'ETL Engineer', 'Data DevOps Engineer',
       'Head of Data', 'Data Science Manager', 'Data Manager',
       'Machine Learning Researcher', 'Big Data Engineer',
       'Data Specialist', 'Lead Data Analyst', 'BI Data Engineer',
       'Director of Data Science', 'Machine Learning Scientist',
       'MLOps Engineer', 'AI Scientist', 'Autonomous Vehicle Technician',
       'Applied Machine Learning Sc

In [125]:
data_df['salary_currency'].unique()

array(['EUR', 'USD', 'INR', 'HKD', 'CHF', 'GBP', 'AUD', 'SGD', 'CAD',
       'ILS', 'BRL', 'THB', 'PLN', 'HUF', 'CZK', 'DKK', 'JPY', 'MXN',
       'TRY', 'CLP'], dtype=object)

In [126]:
data_df['employee_residence'].unique()

array(['ES', 'US', 'CA', 'DE', 'GB', 'NG', 'IN', 'HK', 'PT', 'NL', 'CH',
       'CF', 'FR', 'AU', 'FI', 'UA', 'IE', 'IL', 'GH', 'AT', 'CO', 'SG',
       'SE', 'SI', 'MX', 'UZ', 'BR', 'TH', 'HR', 'PL', 'KW', 'VN', 'CY',
       'AR', 'AM', 'BA', 'KE', 'GR', 'MK', 'LV', 'RO', 'PK', 'IT', 'MA',
       'LT', 'BE', 'AS', 'IR', 'HU', 'SK', 'CN', 'CZ', 'CR', 'TR', 'CL',
       'PR', 'DK', 'BO', 'PH', 'DO', 'EG', 'ID', 'AE', 'MY', 'JP', 'EE',
       'HN', 'TN', 'RU', 'DZ', 'IQ', 'BG', 'JE', 'RS', 'NZ', 'MD', 'LU',
       'MT'], dtype=object)

In [127]:
data_df['company_location'].unique()

array(['ES', 'US', 'CA', 'DE', 'GB', 'NG', 'IN', 'HK', 'NL', 'CH', 'CF',
       'FR', 'FI', 'UA', 'IE', 'IL', 'GH', 'CO', 'SG', 'AU', 'SE', 'SI',
       'MX', 'BR', 'PT', 'RU', 'TH', 'HR', 'VN', 'EE', 'AM', 'BA', 'KE',
       'GR', 'MK', 'LV', 'RO', 'PK', 'IT', 'MA', 'PL', 'AL', 'AR', 'LT',
       'AS', 'CR', 'IR', 'BS', 'HU', 'AT', 'SK', 'CZ', 'TR', 'PR', 'DK',
       'BO', 'PH', 'BE', 'ID', 'EG', 'AE', 'LU', 'MY', 'HN', 'JP', 'DZ',
       'IQ', 'CN', 'NZ', 'CL', 'MD', 'MT'], dtype=object)

In [128]:
data_df['company_size'].unique()

array(['L', 'S', 'M'], dtype=object)

In [129]:
data_df.shape

(3755, 11)

In [130]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


- There are no null entries in this dataframe, so we don't need to worry about any null entries.

In [131]:
data_df.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3755.0,3755.0,3755.0,3755.0
mean,2022.373635,190695.6,137570.38988,46.271638
std,0.691448,671676.5,63055.625278,48.58905
min,2020.0,6000.0,5132.0,0.0
25%,2022.0,100000.0,95000.0,0.0
50%,2022.0,138000.0,135000.0,0.0
75%,2023.0,180000.0,175000.0,100.0
max,2023.0,30400000.0,450000.0,100.0


- Dropping the column 'salary_currency' which mentions the currency in which a particular employee receives the amount, since we already have the salaries given in usd.

In [132]:
data_df.drop(['salary', 'salary_currency'], axis = 1, inplace = True)
data_df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,US,100,US,L


- Representing given salary in thousands to avoid large values for convenience. Accordingly, the column name has been changed to 'salary_in_usd_k'.
- The remote ratio percentage has been converted to a value between 0 to 1. If an employee has a remote ratio of 1, that implies they always work at a place other than the main office, mostly home or any other work site. Similar logic applies to other values as well.

In [133]:
data_df['salary_in_usd'] = data_df['salary_in_usd']/1000
data_df.rename(columns = {'salary_in_usd': 'salary_in_usd_k'}, inplace = True)
data_df['remote_ratio'] = data_df['remote_ratio']/100

data_df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd_k,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,85.847,ES,1.0,ES,L
1,2023,MI,CT,ML Engineer,30.000,US,1.0,US,S
2,2023,MI,CT,ML Engineer,25.500,US,1.0,US,S
3,2023,SE,FT,Data Scientist,175.000,CA,1.0,CA,M
4,2023,SE,FT,Data Scientist,120.000,CA,1.0,CA,M
...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412.000,US,1.0,US,L
3751,2021,MI,FT,Principal Data Scientist,151.000,US,1.0,US,L
3752,2020,EN,FT,Data Scientist,105.000,US,1.0,US,S
3753,2020,EN,CT,Business Data Analyst,100.000,US,1.0,US,L


- The dataframe is now ready for visualisation and analysis.

In [134]:
import jovian

In [135]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


--------------------

### Exploratory Analysis and Visualization

In this section, we're going to visualise the dataframe. The aim is to observe the trends in working year, company, role, workplace and so on. The analysis can either be computed for each individual column, or by grouping two or more columns and observe the dynamics.



In [136]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')
import nltk

%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [137]:
!pip install country_converter

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [138]:
import country_converter as coco

--------------

### 1. Experience Level
The experience classification of employees in the given dataframe.


- Here, we renamed the 'experience_level' column entries with their full name for clarity.
- Then, we plot a treemap using plotly.express.

In [139]:
data_df['experience_level'] = data_df['experience_level'].replace('EN','Entry-level/Junior')
data_df['experience_level'] = data_df['experience_level'].replace('MI','Mid-level/Intermediate')
data_df['experience_level'] = data_df['experience_level'].replace('SE','Senior-level/Expert')
data_df['experience_level'] = data_df['experience_level'].replace('EX','Executive-level/Director')

ex_level = data_df['experience_level'].value_counts()
fig = px.treemap(ex_level, path = [ex_level.index], values = ex_level.values, 
                title = 'Experience Level')
fig.show()

- We can see that most of the employees are of senior-level/experts.
- This could mean that the skill requirement for Data cience jobs is pretty high.
- On the other hand, the number of employees Executive-level/Director is the least. This is obvious, since there are a couple of managers/directors per office.

----------------

### 2. Job Designation
Number of employees in each job category.

- Since there are many jobs in the dataframe, we're going to consider the top 15 job titles for representation by selecting them through *value_counts()*.
- We then plot a bar graph between the **job titles** and the **number of employees** in that job.
- The necessary labels are written.

In [140]:
top15_job_titles = data_df['job_title'].value_counts()[:15]
fig = px.bar(y = top15_job_titles.values, x = top15_job_titles.index, 
            text = top15_job_titles.values, title = 'Top 15 Job Designations')
fig.update_layout(xaxis_title = "Job Designations", yaxis_title = "Count")
fig.show()

- We can see that out of all the job titles, **Data Engineer**, **Data Scientist** and **Data Analyst** are the most recruiting job roles respectively, followed by Machine Learning Engineer, Analytics Engineer and so on. Anyone looking for opportunities in the field of Data Science should consider going for these areas. 

---------------------

### 3. Employment Type
Employment distribution based on whether they are working on Full-Time, Part-Time, Contract or Freelance.

- The frequencies of the four categories are taken in descending order using *value_counts()*
- Then, we plot a bar graph of the frequency with the necessary labels.

In [141]:
group = data_df['employment_type'].value_counts()
emp_type = ['Full-Time', 'Part-Time', 'Contract', 'Freelance']

fig = px.bar(x = emp_type, y = group.values, 
       color = group.index, text = group.values, 
       title = 'Employment Type Distribution')

fig.update_layout( xaxis_title = "Employment Type", yaxis_title = "count")
fig.show()

- Most of the employees are involved in a Full-Time job, while only a minority are in Part-Time, Contract and Freelance.
- This could mean that Data Science jobs are stable, and employees tend to stay longer with the company.

----------------------------

### 4. Employee Residence 
Location of employees Countrywise.

- Here, we use the *country_convert* library to place the employees on the world map based on their respective country name, which is given in the 'employee_residence' column.
- Then, using the plotly.express module, we make a world map graph, with the color intensity of green denoting the number of employees in that country.

In [142]:
country = coco.convert(names = data_df['employee_residence'], to = "ISO3")
data_df['employee_residence'] = country

In [143]:
residence = data_df['employee_residence'].value_counts()
fig = px.choropleth(locations = residence.index,
                    color = residence.values,
                    color_continuous_scale=px.colors.sequential.YlGn,
                    title = 'Employee Location On Map')
fig.show()

- Since most of the data contains information about employees in USA, it has the most intense green color, while other countries have comparatively lesser intensity.

--------------------------

### 5. Remote Ratio
Number of employees working Fully Remote, Partially Remote or No Remote work mode.

- We plot a bar graph representing the frequencies of the three categories mentioned.

In [144]:
remote_type = ['Fully Remote', 'Partially Remote', 'No Remote Work']

fig = px.bar(x = remote_type, y = data_df['remote_ratio'].value_counts().values,
       color = remote_type, text = data_df['remote_ratio'].value_counts().values,
       title = 'Remote Ratio Distribution')

fig.update_layout( xaxis_title = "Remote Type", yaxis_title = "count")
fig.show()

- Majority of the employees are working in a fully remote mode, while another significant portion consists of employees in a partially remote workplace. Few go to the office for work.
- From this, we can see that Data Science jobs are very comfortable and convenient, since one can do the work from their home itself.

Let us save and upload our work to Jovian before continuing

In [145]:
import jovian

In [146]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


-------------------------------

## Asking and Answering Questions

Let's look at some questions related to the data.



#### Q1: What is the average salary based on the work year?


- We segregate the dataframe based on the 4 work years given, and create 4 new dataframes accordingly.
-Then, we take the average of the salaries in each year, and plot the results using a bar graph.

In [147]:
work_2020 = data_df.loc[(data_df['work_year'] == 2020)]
work_2021 = data_df.loc[(data_df['work_year'] == 2021)]
work_2022 = data_df.loc[(data_df['work_year'] == 2022)]
work_2023 = data_df.loc[(data_df['work_year'] == 2023)]
 


year_salary = pd.DataFrame(columns = ['2020', '2021', '2022', '2023'])
year_salary['2020'] = work_2020.groupby('work_year').mean('salary_in_usd_k')['salary_in_usd_k'].values
year_salary['2021'] = work_2021.groupby('work_year').mean('salary_in_usd_k')['salary_in_usd_k'].values
year_salary['2022'] = work_2022.groupby('work_year').mean('salary_in_usd_k')['salary_in_usd_k'].values
year_salary['2023'] = work_2023.groupby('work_year').mean('salary_in_usd_k')['salary_in_usd_k'].values


fig = go.Figure(data=px.bar(x = year_salary.columns, 
                            y = year_salary.values.tolist()[0],
                            color = year_salary.columns,
                            title = 'Mean Salary by Work Year'))

fig.update_layout(xaxis_title = "Work Year", yaxis_title = "Mean Salary (k)")
fig.show()

- We can see that there has been a consistent growth in the average salaries from the years 2020 - 2023. This in an implication of the growing demand and opportunities in Data Science jobs.
- It seems like the growth is minimal from 2020-2021, mostly due to the economic disruption caused by COVID-19. 
- The end of the pandemic has led to a significant increase in salaries in 2022 and 2023. This is due to the economy recovering fom the disruption.

------------------------

#### Q2: What is the average salary based on the experience of the employee?

- Similar to the above case, we create 4 new dataframes containing each individual experience level.
- Then, we take the average of them and plot on a bar graph.

In [148]:
exp_salary = data_df[['experience_level','salary_in_usd_k']]

entry_salary = exp_salary.loc[exp_salary['experience_level'] == 'Entry-level/Junior']
executive_salary = exp_salary.loc[exp_salary['experience_level'] == 'Executive-level/Director']
mid_salary = exp_salary.loc[exp_salary['experience_level'] == 'Mid-level/Intermediate']
senior_salary = exp_salary.loc[exp_salary['experience_level'] == 'Senior-level/Expert']


group_labels = ['Entry-level/Junior', 'Mid-level/Intermediate', 'Senior-level/Expert', 'Executive-level/Director']

means = [entry_salary['salary_in_usd_k'].mean(), mid_salary['salary_in_usd_k'].mean(),
    senior_salary['salary_in_usd_k'].mean(), executive_salary['salary_in_usd_k'].mean(),]


fig = go.Figure(data=px.bar(x = group_labels, y = means, color = group_labels,
                            title = 'Mean Salary by Experience Level'))
                            

fig.update_layout(xaxis_title = "Experience Level", yaxis_title = "Mean Salary (k) ")

fig.show()

- We can observe that the employees in the category of Executive/Director have the highest mean salary. Also, the salary growth is proportional to the experience of the employee, meaning that like in any other industry, your salary tends to increase as your experience increases.

------------------------------

#### Q3: How does the salary of an employee vary with respect to the size of the company they work at?

- We again plot a bar graph showng the mean values of employees working at small, mid and large companies.

In [149]:
company_size = data_df[['company_size','salary_in_usd_k']]
small = exp_salary.loc[company_size['company_size'] == 'S']
mid = exp_salary.loc[company_size['company_size'] == 'M']
large = exp_salary.loc[company_size['company_size'] == 'L']

group_labels = ['Company Size: Small', 'Company Size: Mid', 'Company Size: Large']

means = [small['salary_in_usd_k'].mean(), mid['salary_in_usd_k'].mean(), large['salary_in_usd_k'].mean()]


fig = go.Figure(data = px.bar(x = group_labels, y = means, color = group_labels,
                title = 'Mean Salary by Company Size'))


fig.update_layout( xaxis_title = "Company Size", yaxis_title = "Mean Salary (k)")

fig.show()

- Interestingly, employees working at mid-sized companies receive a higher salary on an average, compared to those working at larger companies.

----------------------------------

#### Q4: What are the highest salaries received by the employees?

- Here, we take the top 25 highest paid employees and represent on a bar graph, stacking one on another if there is more than one employee in the top 25 in the same job title.

In [150]:
salary_designation = data_df.groupby(['salary_in_usd_k', 'job_title']).size().reset_index()
salary_designation = salary_designation[-25:]
fig = px.bar(x = salary_designation['job_title'], y = salary_designation['salary_in_usd_k'],
            text = salary_designation['salary_in_usd_k'], color = salary_designation['salary_in_usd_k'])

fig.update_layout( xaxis_title = "Job Designation", yaxis_title = "Salaries in k ")
fig.update_layout(xaxis_tickangle = -45, 
                  title = 'Top 25 Highest Salary by Designation')

- It seems that the job titles **Research Scientist** and **Data Scientist** have a good number of employees with high salaries, followed by Machine Learning Engineer, Director of Data Science, Data Architect and Data Analyst.
- Research Scientists are low in number compared to the top job titles, but still have higher salaries compared to the others. This may mean that it's hard to be a Research Scientist, but the pay is more.

------------------------------

#### Q5: What is the average composition of companies based on the experience of their employees?

- We first group data based on the size of the companies, and represent the number of employees in each experience category on a bar graph.

In [151]:
exp_size = data_df.groupby(['experience_level','company_size']).size()
fig = go.Figure(data = [
    go.Bar(name = 'Entry-level/Junior', x = exp_size['Entry-level/Junior'].index,
           y = exp_size['Entry-level/Junior'].values, text = exp_size['Entry-level/Junior'].values),
    go.Bar(name = 'Executive-level/Director', x = exp_size['Executive-level/Director'].index,
           y = exp_size['Executive-level/Director'].values, text = exp_size['Executive-level/Director'].values),
    go.Bar(name = 'Mid-level/Intermediate', x = exp_size['Mid-level/Intermediate'].index,
           y = exp_size['Mid-level/Intermediate'].values, text = exp_size['Mid-level/Intermediate'].values),
    go.Bar(name = 'Senior-level/Expert', x = exp_size['Senior-level/Expert'].index,
           y = exp_size['Senior-level/Expert'].values, text = exp_size['Senior-level/Expert'].values),
])
fig.update_layout(xaxis_tickangle = -45, title = 'Experience Level with Company Size')

fig.show()

- In large companies, senior employees are the most, followed by intermediate and junior employees.
- In medium companies, most of the employees are seniors, followed by intermediate and junior employees.
- In small companies, seniors, intermediates and juniors are almost equal.
- In all the 3 cases, Directors/Executives are the least, since they are mostly the leads of the comapnies, and are lesser in number.

Let us save and upload our work to Jovian before continuing.

In [152]:
import jovian

In [153]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


----------------------------------

## Inferences and Conclusion

- From the above analysis, it has been clear that the industry of Data Science is growing, and has a bright scope in the future as well. I believe this project gives an insight into the positive trends of the Data Science domain.

In [154]:
import jovian

In [155]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


--------------------------------

## References and Future Work

- This dataset may be modified further by merging with a dataset containing jobs across the world, and further insights that may include aspects like work hours, job satisfaction etc.

- Resources that were found to be helpful while making this project: \
  1. Numpy documentation: https://numpy.org/doc/ \
  2. Pandas documentation: https://pandas.pydata.org/docs/ \
  3. Matplotlib documentation: https://matplotlib.org/stable/index.html \
  4. Seaborn documentation: https://seaborn.pydata.org/ \
  5. Plotly documentation: https://plotly.com/python/ \

In [156]:
import jovian

In [157]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m
