# 1. Understanding the data

## Non graphical approach


In [1]:
#Load the required libraries
import pandas as pd
import numpy as np
import seaborn as sns

#Load the data
df = pd.read_csv('./data/salaries.csv')


#View the data
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2025,SE,FT,Manager,115000,USD,115000,US,0,US,M
1,2025,SE,FT,Manager,85000,USD,85000,US,0,US,M
2,2025,SE,FT,Consultant,171800,USD,171800,US,0,US,M
3,2025,SE,FT,Consultant,96600,USD,96600,US,0,US,M
4,2025,MI,FT,Data Scientist,145000,USD,145000,US,0,US,M


In [2]:
#Basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73306 entries, 0 to 73305
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           73306 non-null  int64 
 1   experience_level    73306 non-null  object
 2   employment_type     73306 non-null  object
 3   job_title           73306 non-null  object
 4   salary              73306 non-null  int64 
 5   salary_currency     73306 non-null  object
 6   salary_in_usd       73306 non-null  int64 
 7   employee_residence  73306 non-null  object
 8   remote_ratio        73306 non-null  int64 
 9   company_location    73306 non-null  object
 10  company_size        73306 non-null  object
dtypes: int64(4), object(7)
memory usage: 6.2+ MB


There are 11 columns or variable types, from which 7 are categorical and 4 scalar. It seems that NaN values are not included in the dataset based this table.

In [4]:
#Examine descriptive statistic
df.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,73306.0,73306.0,73306.0,73306.0
mean,2023.833711,162596.9,157987.709151,21.560991
std,0.480106,192962.1,72504.604716,41.00808
min,2020.0,14000.0,15000.0,0.0
25%,2024.0,106840.0,106800.0,0.0
50%,2024.0,148000.0,147500.0,0.0
75%,2024.0,200000.0,199700.0,0.0
max,2025.0,30400000.0,800000.0,100.0


This table shows that there are only 4 continuous variables to calculate values with descriptive statistic. The information collected goes from the year 2020-2025, the column of salary variates because it is given in different currencies, but when normalized to USD the salary range in Data professions goes from 15.000 USD to 800.000 USD. The most common salary in USD and remote ratio are obtained with "the mode" below.

In [None]:
#Obtaining modes 
print('Most common salary in Dollars (USD):', df['salary_in_usd'].mode())
print('Most common remote ratio:', df['remote_ratio'].mode())

Most common salary in Dollars (USD): 0    160000
Name: salary_in_usd, dtype: int64
Most common remote ratio: 0    0
Name: remote_ratio, dtype: int64


The most common salary is around 160.000 USD a year, and the most common remote ratio (0) indicates that the remote Data jobs are not common.

In [5]:
#checking for missing values
df.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

There are not missing values!

In [None]:
#Exploring values in categorical variables 
print('Experience level:', df['experience_level'].unique())
print('Employment type:', df['employment_type'].unique())
print('Job title:', df['job_title'].unique())
print('Company size:', df['company_size'].unique())


Experience level: ['SE' 'MI' 'EN' 'EX']
Employment type: ['FT' 'CT' 'PT' 'FL']
Job title: ['Manager' 'Consultant' 'Data Scientist' 'Software Engineer' 'Analyst'
 'Architect' 'Data Analyst' 'Associate' 'Data Management Specialist'
 'Data Governance' 'Data Engineer' 'Product Manager' 'Applied Scientist'
 'Software Development Engineer' 'Engineer' 'Research Scientist'
 'Machine Learning Engineer' 'Machine Learning Scientist'
 'Business Intelligence Developer' 'Data Architect' 'Power BI Developer'
 'Data Product Owner' 'AI Architect' 'AI Engineer' 'Research Engineer'
 'Data Manager' 'Quantitative Developer' 'Technical Lead'
 'Sales Development Representative' 'System Engineer' 'Analytics Engineer'
 'Solution Architect' 'Encounter Data Management Professional'
 'Data Infrastructure Engineer' 'Data Team Lead'
 'Business Intelligence Lead' 'DevOps Engineer' 'Decision Scientist'
 'Data Visualization Engineer' 'Data Governance Analyst'
 'Data Quality Analyst' 'Lead Analyst' 'Data Specialist'
 '

In [None]:
#Exploring values in categorical variables
print('Salary currency:', df['salary_currency'].unique())
print('Employee residence:', df['employee_residence'].unique())
print('Company location:', df['company_location'].unique())

Salary currency: ['USD' 'INR' 'GBP' 'EUR' 'PHP' 'CAD' 'SGD' 'BRL' 'PLN' 'CHF' 'AUD' 'JPY'
 'DKK' 'CZK' 'HUF' 'MXN' 'ILS' 'TRY' 'ZAR' 'SEK' 'NZD' 'NOK' 'HKD' 'THB'
 'CLP']
Employee residence: ['US' 'CA' 'GB' 'AU' 'IN' 'DE' 'LT' 'SK' 'FR' 'AT' 'PH' 'AM' 'SG' 'LU'
 'BR' 'NL' 'IT' 'CO' 'CL' 'PL' 'CY' 'ES' 'RW' 'NZ' 'CH' 'LV' 'IL' 'CZ'
 'IE' 'JP' 'PE' 'KR' 'ZA' 'EG' 'PR' 'LB' 'GR' 'AR' 'FI' 'MX' 'DK' 'NG'
 'BE' 'BG' 'EC' 'SV' 'CR' 'HU' 'PT' 'HR' 'KE' 'SE' 'UA' 'TR' 'PK' 'HN'
 'MT' 'RO' 'VE' 'BM' 'VN' 'RS' 'GE' 'AE' 'SA' 'OM' 'BA' 'EE' 'UG' 'SI'
 'MU' 'TH' 'QA' 'RU' 'TN' 'GH' 'AD' 'MD' 'NO' 'UZ' 'HK' 'CF' 'KW' 'IR'
 'AS' 'CN' 'BO' 'DO' 'ID' 'MY' 'DZ' 'IQ' 'JE']
Company location: ['US' 'CA' 'GB' 'AU' 'IN' 'DE' 'LT' 'SK' 'FR' 'AT' 'PH' 'AM' 'SG' 'LU'
 'BR' 'NL' 'IT' 'CO' 'CL' 'PL' 'CY' 'ES' 'CD' 'NZ' 'CH' 'LV' 'IL' 'CZ'
 'IE' 'JP' 'PE' 'KR' 'ZA' 'EG' 'PR' 'LB' 'GR' 'AR' 'FI' 'MX' 'DK' 'NG'
 'BE' 'BG' 'EC' 'SV' 'CR' 'HU' 'PT' 'HR' 'KE' 'SE' 'UA' 'TR' 'PK' 'HN'
 'MT' 'RO' 'VE' 'DZ' 'AS' 'RS' 'AE

## Graphical approach
This part will help to visualize the categorical variables to understand the data.