### 🖋 **Notebook Contents**

0. Initial Setup
1. Business Problem Understanding
2. Data Understanding
3. Data Preparation

****

## `Initial Setup`

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import missingno as msno

import warnings
warnings.filterwarnings("ignore")

## `Business Problem`

## `Data Understanding`

Dataset can be access through this link: [dataset!](https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023)

| Columns                                            | Definition                                                   | 
| ------------------------------------------------- | ------------------------------------------------------------ |
| `work_year` | The year the salary was paid. |
| `experience_level` | The experience level in the job during the year |
| `employment_type` | The type of employment for the role |
| `job_title` | The role worked in during the year. |
| `salary` | The total gross salary amount paid. |
| `salary_currency` |  The currency of the salary paid as an ISO 4217 currency code. |
| `salary_in_usd` | The salary in USD |
| `employee_residence` | Employee's primary country of residence in during the work year as an ISO 3166 country code. |
| `remote_ratio` | The overall amount of work done remotely |
| `company_location` | The country of the employer's main office or contracting branch |
| `company_size` | The median number of people that worked for the company during the year |

In [3]:
data = pd.read_csv("../data/raw/ds_salaries.csv")
data.sample(10)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
3538,2021,MI,CT,ML Engineer,270000,USD,270000,US,100,US,L
2564,2022,SE,FT,Data Engineer,105700,USD,105700,US,0,US,L
470,2023,MI,FT,Data Engineer,75000,USD,75000,US,0,US,M
510,2023,SE,FT,Data Analyst,125600,USD,125600,US,0,US,M
2420,2022,MI,FT,Data Engineer,90000,USD,90000,US,100,US,M
1619,2023,MI,FT,Data Engineer,95000,USD,95000,US,0,US,M
1873,2022,EN,FT,Data Science Consultant,23000,EUR,24165,IT,50,IT,M
2219,2022,SE,FT,Analytics Engineer,110000,USD,110000,US,0,US,M
843,2023,SE,FT,Data Scientist,186300,USD,186300,US,100,US,M
792,2023,SE,FT,Data Engineer,129000,USD,129000,US,0,US,M


In [7]:
pd.DataFrame(
                {
                'columns': data.columns.values,
                'data_type': data.dtypes.values,
                'null_value(%)': data.isna().mean().values * 100,
                'n_unique': data.nunique().values,
                'zero_value' : [True if (data[col] == 0).any() else False for col in data.columns],
                'neg_value' : [True if (data[col].dtype == int or data[col].dtype == float) and (data[col] < 0).any() else False for col in data.columns],
                'min': data.min().values,
                'max': data.max().values,
                'sample_unique': [data[col].unique() for col in data.columns]
                }
            )

Unnamed: 0,columns,data_type,null_value(%),n_unique,zero_value,neg_value,min,max,sample_unique
0,work_year,int64,0.0,4,False,False,2020,2023,"[2023, 2022, 2020, 2021]"
1,experience_level,object,0.0,4,False,False,EN,SE,"[SE, MI, EN, EX]"
2,employment_type,object,0.0,4,False,False,CT,PT,"[FT, CT, FL, PT]"
3,job_title,object,0.0,93,False,False,3D Computer Vision Researcher,Staff Data Scientist,"[Principal Data Scientist, ML Engineer, Data S..."
4,salary,int64,0.0,815,False,False,6000,30400000,"[80000, 30000, 25500, 175000, 120000, 222200, ..."
5,salary_currency,object,0.0,20,False,False,AUD,USD,"[EUR, USD, INR, HKD, CHF, GBP, AUD, SGD, CAD, ..."
6,salary_in_usd,int64,0.0,1035,False,False,5132,450000,"[85847, 30000, 25500, 175000, 120000, 222200, ..."
7,employee_residence,object,0.0,78,False,False,AE,VN,"[ES, US, CA, DE, GB, NG, IN, HK, PT, NL, CH, C..."
8,remote_ratio,int64,0.0,3,True,False,0,100,"[100, 0, 50]"
9,company_location,object,0.0,72,False,False,AE,VN,"[ES, US, CA, DE, GB, NG, IN, HK, NL, CH, CF, F..."


**_Insight_**:
- This dataset contains 3755 rows and 11 columns
- Variable data type:
<br>
    a. Numerical
    <br>
        - Discrete : -
        <br>
        - Continue : salary, salary_in_usd

    b. Categorical
    <br>
        - Nominal : employment_type, job_title, salary_currency, employee_residence, company_location
        <br>
        - Ordinal : experience_level, remote_ratio, company_size

    c. Datetime
    <br>
        - work_year
- There is no missing values