# AKINGBENI - Wrangling and EDA of data science salaries.

In [2]:
#Import libraries

#data wrangling libraries
import pandas as pd
import numpy as np

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#import misc libraries
import warnings

In [3]:
#Read in the dataset
filepath = "./data/ds_salaries (1).csv"
df = pd.read_csv(filepath)
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


## Information and introduction about dataset

> Data science was described in 2012 as the sexiest job of the century by [Havard Business Review](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). At the heart of data science is traditional statistics, computer science and computational power, and business domains. The blend of these three fields in different proportions and perspectives has given rise to varying form of data scientists.

> Some, I am among some, argue that roles like data engineers do fit into the term data scientists, perhaps a full stack data scientists. Popular roles that are common under data science are Data Analytics, Data Scientits, Machine Learning Scientists/Researcher, Machine Learning Engineer and more specific roles like data vizualization experts.

> However, while passion is a great deal in whatever one does. Subjects like salaries and opportunities should not be overlooked when in search of a career path. I personally struggled with these. I still have my reservations, but I will be focusing my story on the data in front of us.


### Introduction about dataset
> The dataset contains salary information of over 600 people of different job levels and seniority levels. It contains years between 2020 - 2022 to capture post covid data science information and salaries.

**Features**
>- work_year: The year the salary was paid.
>- experience_level: The experience level in the job during the year with the following possible values: `EN ---> Entry-level / Junior`, `MI ---> Mid-level / Intermediate`, `SE ---> Senior-level / Expert`, `EX ---> Executive-level / Director`
>- employment_type: The type of employement for the role: `PT ---> Part-time`, `FT ---> Full-time`, `CT ---> Contract`, `FL ---> Freelance`
>- job_title: The role worked in during the year.
>- salary: The total gross salary amount paid.
>- salary_currency: The currency of the salary paid as an ISO 4217 currency code.
>- salaryinusd: The salary in USD (FX rate divided by avg. USD rate for the respective year via [Fx](fxdata.foorilla.com).
>- employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
>- remote_ratio: The overall amount of work done remotely, possible values are as follows: 0 = No remote work (less than 20%), 50 = Partially remote, 100 = Fully remote (more than 80%)
>- company_location: The country of the employer's main office or contracting branch as an ISO 3166 country code.
>- company_size: The average number of people that worked for the company during the year: `S ---> less than 50 employees(small)`, `M ---> 50 to 250 employees (medium)`, `L ---> more than 250 employees (large)`

[Direct info about dataset on kaggle](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries)

## Data Wrangling

In [5]:
#Check general head information
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [17]:
#Visual Assessment
df.sample(6)

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
39,39,2020,EN,FT,Machine Learning Engineer,138000,USD,138000,US,100,US,S
23,23,2020,MI,FT,BI Data Analyst,98000,USD,98000,US,0,US,M
582,582,2022,SE,FT,Data Engineer,220110,USD,220110,US,100,US,M
92,92,2021,MI,FT,Lead Data Analyst,1450000,INR,19609,IN,100,IN,L
84,84,2021,EX,FT,Director of Data Science,130000,EUR,153667,IT,100,PL,L
135,135,2021,MI,FT,Data Analyst,90000,USD,90000,US,100,US,M


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          607 non-null    int64 
 1   work_year           607 non-null    int64 
 2   experience_level    607 non-null    object
 3   employment_type     607 non-null    object
 4   job_title           607 non-null    object
 5   salary              607 non-null    int64 
 6   salary_currency     607 non-null    object
 7   salary_in_usd       607 non-null    int64 
 8   employee_residence  607 non-null    object
 9   remote_ratio        607 non-null    int64 
 10  company_location    607 non-null    object
 11  company_size        607 non-null    object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB


In [8]:
#check the unique values in the work_year columns
df.work_year.value_counts()

2022    318
2021    217
2020     72
Name: work_year, dtype: int64

In [9]:
#Check unique values of experience_level is as expeected

df.experience_level.unique()

array(['MI', 'SE', 'EN', 'EX'], dtype=object)

In [11]:
#Check unique values of employment types
df.employment_type.unique()

array(['FT', 'CT', 'PT', 'FL'], dtype=object)

In [18]:
df.job_title.value_counts()

Data Scientist                              143
Data Engineer                               132
Data Analyst                                 97
Machine Learning Engineer                    41
Research Scientist                           16
Data Science Manager                         12
Data Architect                               11
Big Data Engineer                             8
Machine Learning Scientist                    8
Principal Data Scientist                      7
AI Scientist                                  7
Data Science Consultant                       7
Director of Data Science                      7
Data Analytics Manager                        7
ML Engineer                                   6
Computer Vision Engineer                      6
BI Data Analyst                               6
Lead Data Engineer                            6
Data Engineering Manager                      5
Business Data Analyst                         5
Head of Data                            

In [20]:
df.employee_residence.unique()

array(['DE', 'JP', 'GB', 'HN', 'US', 'HU', 'NZ', 'FR', 'IN', 'PK', 'PL',
       'PT', 'CN', 'GR', 'AE', 'NL', 'MX', 'CA', 'AT', 'NG', 'PH', 'ES',
       'DK', 'RU', 'IT', 'HR', 'BG', 'SG', 'BR', 'IQ', 'VN', 'BE', 'UA',
       'MT', 'CL', 'RO', 'IR', 'CO', 'MD', 'KE', 'SI', 'HK', 'TR', 'RS',
       'PR', 'LU', 'JE', 'CZ', 'AR', 'DZ', 'TN', 'MY', 'EE', 'AU', 'BO',
       'IE', 'CH'], dtype=object)

### Data Assessments

- Unnamed column irrelevant, a product of not ignoring the index whem it was saved
- work year column should be datetime datatype
- Infromation shorthand entry in experience level is ambigous
- Experience level should be an ordered category type
- Redundant columns ---> salary and salary_currency
- Columns of employment type contains ambigous values
- There are varying form of job description in the job title column of the same job.
- Employe Residence and company loaction is ambigous

In [None]:
#Define function that uses str.contains to replace certain job types.

## Conclusion


## Limitations

- The year is very limited and does not contain before 2020, for proper comparison between pre-covid and post-covid times.