# DATA JOBS SALARY ANALYSIS

### Content: 

1. work_year: The year the salary was paid.
   
2. experience_level: The experience level in the job during the year
   
3. employment_type: The type of employment for the role
   
4. job_title: The role worked in during the year.
   
5. salary: The total gross salary amount paid.
   
6. salary_currency: The currency of the salary paid as an ISO 4217 currency code.
   
7. salaryinusd: The salary in USD
   
8. employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
   
9.  remote_ratio: The overall amount of work done remotely
    
10. company_location: The country of the employer's main office or contracting branch
    
11. company_size: The median number of people that worked for the company during the year


#### There's 4 categorical values in column 'Experience Level', each are:

        - EN, which refers to Entry-level / Junior.

        - MI, which refers to Mid-level / Intermediate.

        - SE, which refers to Senior-level / Expert.

        - EX, which refers to Executive-level / Director.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('ds_salaries.csv')

In [3]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [4]:
df.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [7]:
df.shape

(3755, 11)

In [13]:
df['company_location'].unique()

array(['ES', 'US', 'CA', 'DE', 'GB', 'NG', 'IN', 'HK', 'NL', 'CH', 'CF',
       'FR', 'FI', 'UA', 'IE', 'IL', 'GH', 'CO', 'SG', 'AU', 'SE', 'SI',
       'MX', 'BR', 'PT', 'RU', 'TH', 'HR', 'VN', 'EE', 'AM', 'BA', 'KE',
       'GR', 'MK', 'LV', 'RO', 'PK', 'IT', 'MA', 'PL', 'AL', 'AR', 'LT',
       'AS', 'CR', 'IR', 'BS', 'HU', 'AT', 'SK', 'CZ', 'TR', 'PR', 'DK',
       'BO', 'PH', 'BE', 'ID', 'EG', 'AE', 'LU', 'MY', 'HN', 'JP', 'DZ',
       'IQ', 'CN', 'NZ', 'CL', 'MD', 'MT'], dtype=object)

In [14]:
df['work_year'].unique()

array([2023, 2022, 2020, 2021], dtype=int64)

In [15]:
df['salary_currency'].unique()

array(['EUR', 'USD', 'INR', 'HKD', 'CHF', 'GBP', 'AUD', 'SGD', 'CAD',
       'ILS', 'BRL', 'THB', 'PLN', 'HUF', 'CZK', 'DKK', 'JPY', 'MXN',
       'TRY', 'CLP'], dtype=object)

In [19]:
unique_values = df.apply(lambda x: x.unique())

print(unique_values)

work_year                                      [2023, 2022, 2020, 2021]
experience_level                                       [SE, MI, EN, EX]
employment_type                                        [FT, CT, FL, PT]
job_title             [Principal Data Scientist, ML Engineer, Data S...
salary                [80000, 30000, 25500, 175000, 120000, 222200, ...
salary_currency       [EUR, USD, INR, HKD, CHF, GBP, AUD, SGD, CAD, ...
salary_in_usd         [85847, 30000, 25500, 175000, 120000, 222200, ...
employee_residence    [ES, US, CA, DE, GB, NG, IN, HK, PT, NL, CH, C...
remote_ratio                                               [100, 0, 50]
company_location      [ES, US, CA, DE, GB, NG, IN, HK, NL, CH, CF, F...
company_size                                                  [L, S, M]
dtype: object


In [18]:
# Get unique combinations of values across all columns
unique_combinations = df.drop_duplicates()

# Display the unique combinations
print(unique_combinations)

      work_year experience_level employment_type                 job_title  \
0          2023               SE              FT  Principal Data Scientist   
1          2023               MI              CT               ML Engineer   
2          2023               MI              CT               ML Engineer   
3          2023               SE              FT            Data Scientist   
4          2023               SE              FT            Data Scientist   
...         ...              ...             ...                       ...   
3750       2020               SE              FT            Data Scientist   
3751       2021               MI              FT  Principal Data Scientist   
3752       2020               EN              FT            Data Scientist   
3753       2020               EN              CT     Business Data Analyst   
3754       2021               SE              FT      Data Science Manager   

       salary salary_currency  salary_in_usd employee_residence

In [20]:
df.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3755.0,3755.0,3755.0,3755.0
mean,2022.373635,190695.6,137570.38988,46.271638
std,0.691448,671676.5,63055.625278,48.58905
min,2020.0,6000.0,5132.0,0.0
25%,2022.0,100000.0,95000.0,0.0
50%,2022.0,138000.0,135000.0,0.0
75%,2023.0,180000.0,175000.0,100.0
max,2023.0,30400000.0,450000.0,100.0
