In [2]:
import pandas as pd

In [3]:
df_jobs = pd.read_csv('../imgs/ds_salaries.csv', sep=',')
df_jobs.head(2)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S


#### `.groupby()`
Is a method which can group data in function of columns. For instance, in the code below, we grouped info according to *'job_title'* column which presented all the rest of unnecesary columns (each of them accounts for same totalized repeats of grouped categories in *'job_title'*), in this case column *'work_year'* was presented and renamed as *'Total'*.

In [4]:
df_jobs.groupby('job_title').count().loc[:,['work_year']].rename(columns={'work_year': 'Total'}).sort_values('Total', ascending=False).head(10)


Unnamed: 0_level_0,Total
job_title,Unnamed: 1_level_1
Data Scientist,143
Data Engineer,132
Data Analyst,97
Machine Learning Engineer,41
Research Scientist,16
Data Science Manager,12
Data Architect,11
Big Data Engineer,8
Machine Learning Scientist,8
Director of Data Science,7


If grouped values are int datatypes, some useful statistical methods could be used for bringing some quick insights. Those methods could be `.sum()`, `.mean()`, `.std()`, `min()`, `max()`, etc.

In the code block below, we filtered dataframe by two interesting columns which related *'job_title'* and *'salary_in_usd'*, using `.loc[]` method and then grouping by *'job_title'*. This block allowed us to check which were the best paid titles in Data Science, although it is not filtered by year or other comparable categories.

In [28]:
df_jobs.loc[:, ['job_title','salary_in_usd']].groupby('job_title').mean().sort_values('salary_in_usd', ascending=False).head(10)

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Data Analytics Lead,405000.0
Principal Data Engineer,328333.333333
Financial Data Analyst,275000.0
Principal Data Scientist,215242.428571
Director of Data Science,195074.0
Data Architect,177873.909091
Applied Data Scientist,175655.0
Analytics Engineer,175000.0
Data Specialist,165000.0
Head of Data,160162.6


`.agg()` method allows us to apply some functions to each cited column in the DataFrame.

In [29]:
df_jobs.loc[:,['job_title', 'salary_in_usd']].groupby('job_title').agg(['min', 'max']).head(10)

Unnamed: 0_level_0,salary_in_usd,salary_in_usd
Unnamed: 0_level_1,min,max
job_title,Unnamed: 1_level_2,Unnamed: 2_level_2
3D Computer Vision Researcher,5409,5409
AI Scientist,12000,200000
Analytics Engineer,135000,205300
Applied Data Scientist,54238,380000
Applied Machine Learning Scientist,31875,423000
BI Data Analyst,9272,150000
Big Data Architect,99703,99703
Big Data Engineer,5882,114047
Business Data Analyst,18442,135000
Cloud Data Engineer,89294,160000


Code block below shows up a roughly calculated hourly salary in COP using lambda function inside `.agg()` method and renaming modified column.

In [49]:
df_jobs.loc[:,['job_title', 'salary_in_usd']].groupby('job_title').mean().agg({'salary_in_usd': lambda x: round(x*(4500/8280), 1)}).rename(columns={'salary_in_usd' : 'hourly_salary_cop'}).sort_values('hourly_salary_cop', ascending=False).head(5)

Unnamed: 0_level_0,hourly_salary_cop
job_title,Unnamed: 1_level_1
Data Analytics Lead,220108.7
Principal Data Engineer,178442.0
Financial Data Analyst,149456.5
Principal Data Scientist,116979.6
Director of Data Science,106018.5
