In [18]:
import pandas as pd

In [19]:
df_salaries = pd.read_csv('./imgs/ds_salaries.csv', sep=',')
df_salaries.head(2)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S


`.info()` method is used to return a summary of our DataFrame, which includes:

* The total number of rows in the DataFrame
* The column names
* The count of non-null values in each column
* The data type of each column (d-type)
* The memory usage of the entire DataFrame

In [20]:
df_salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB


`.describe()` shows a quick statistc summary of our data:

In [21]:
df_salaries.describe()

Unnamed: 0.1,Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,607.0,607.0,607.0,607.0,607.0
mean,303.0,2021.405272,324000.1,112297.869852,70.92257
std,175.370085,0.692133,1544357.0,70957.259411,40.70913
min,0.0,2020.0,4000.0,2859.0,0.0
25%,151.5,2021.0,70000.0,62726.0,50.0
50%,303.0,2022.0,115000.0,101570.0,100.0
75%,454.5,2022.0,165000.0,150000.0,100.0
max,606.0,2022.0,30400000.0,600000.0,100.0


In this filter, we're searching for salaries that exceeds 3rd quartile and experience level is entry-level at same time.

In [22]:
df_salaries[(df_salaries['salary_in_usd'] > 101000) & (df_salaries['experience_level'] == 'EN')]

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
37,37,2020,EN,FT,Machine Learning Engineer,250000,USD,250000,US,50,US,L
39,39,2020,EN,FT,Machine Learning Engineer,138000,USD,138000,US,100,US,S
68,68,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
115,115,2021,EN,FT,Machine Learning Scientist,225000,USD,225000,US,100,US,L
123,123,2021,EN,FT,Applied Data Scientist,80000,GBP,110037,GB,0,GB,L
159,159,2021,EN,FT,Machine Learning Engineer,125000,USD,125000,US,100,US,S
454,454,2022,EN,FT,Computer Vision Engineer,125000,USD,125000,US,0,US,M
465,465,2022,EN,FT,Data Engineer,120000,USD,120000,US,100,US,M
508,508,2022,EN,FT,Research Scientist,120000,USD,120000,US,100,US,L
510,510,2022,EN,FT,Computer Vision Software Engineer,150000,USD,150000,AU,100,AU,S


`.tail()` is a method that returns the last n rows of a DataFrame:

In [23]:
df_salaries.tail(2)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
605,605,2022,SE,FT,Data Analyst,150000,USD,150000,US,100,US,M
606,606,2022,MI,FT,AI Scientist,200000,USD,200000,IN,100,US,L


`.memory_usage()` returns memory use attributed to each column:

In [24]:
df_salaries.memory_usage(deep=True)

Index                   128
Unnamed: 0             4856
work_year              4856
experience_level      35813
employment_type       35813
job_title             44468
salary                 4856
salary_currency       36420
salary_in_usd          4856
employee_residence    35813
remote_ratio           4856
company_location      35813
company_size          35206
dtype: int64

`.value_counts()` returns total repetitions in each category of a given column. For instance, in our dataset, we can find the column category known as ``job_title`` which has categories as *Data Scientist* or *Data Analyst*.

In [25]:
df_salaries['job_title'].value_counts()

job_title
Data Scientist                              143
Data Engineer                               132
Data Analyst                                 97
Machine Learning Engineer                    41
Research Scientist                           16
Data Science Manager                         12
Data Architect                               11
Big Data Engineer                             8
Machine Learning Scientist                    8
Principal Data Scientist                      7
AI Scientist                                  7
Data Science Consultant                       7
Director of Data Science                      7
Data Analytics Manager                        7
ML Engineer                                   6
Computer Vision Engineer                      6
BI Data Analyst                               6
Lead Data Engineer                            6
Data Engineering Manager                      5
Business Data Analyst                         5
Head of Data                  

Let's imagine, that we wrongly duplicated original DataFrame without making a copy and modifying it. Exists a method for dropping any duplicates which is known as `.drop_duplicates()`, quite easy to remember... doesn't it?

In [44]:
df_salaries = pd.concat([df_salaries, df_salaries])
df_salaries.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1214 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          1214 non-null   int64 
 1   work_year           1214 non-null   int64 
 2   experience_level    1214 non-null   object
 3   employment_type     1214 non-null   object
 4   job_title           1214 non-null   object
 5   salary              1214 non-null   int64 
 6   salary_currency     1214 non-null   object
 7   salary_in_usd       1214 non-null   int64 
 8   employee_residence  1214 non-null   object
 9   remote_ratio        1214 non-null   int64 
 10  company_location    1214 non-null   object
 11  company_size        1214 non-null   object
dtypes: int64(5), object(7)
memory usage: 123.3+ KB


In [51]:
df_salaries = df_salaries.drop_duplicates()
df_salaries.info()

<class 'pandas.core.frame.DataFrame'>
Index: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 61.6+ KB


Sorting ascending values in a column could be harnessed with ```sort_values('column', ascending=True)```

In [55]:
df_salaries.sort_values('salary_in_usd', ascending=True).head(5)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
176,176,2021,MI,FT,Data Scientist,58000,MXN,2859,MX,0,MX,S
185,185,2021,MI,FT,Data Engineer,4000,USD,4000,IR,100,IR,M
238,238,2021,EN,FT,Data Scientist,4000,USD,4000,VN,0,VN,M
77,77,2021,MI,PT,3D Computer Vision Researcher,400000,INR,5409,IN,50,IN,M
179,179,2021,MI,FT,Data Scientist,420000,INR,5679,IN,100,US,S


In [56]:
df_salaries.sort_values('salary_in_usd', ascending=False).head(5)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
252,252,2021,EX,FT,Principal Data Engineer,600000,USD,600000,US,100,US,L
97,97,2021,MI,FT,Financial Data Analyst,450000,USD,450000,US,100,US,L
33,33,2020,MI,FT,Research Scientist,450000,USD,450000,US,0,US,M
157,157,2021,MI,FT,Applied Machine Learning Scientist,423000,USD,423000,US,50,US,L
225,225,2021,EX,CT,Principal Data Scientist,416000,USD,416000,US,100,US,S
