# JOBS IN DATA

This script cleans the jobs_in_data dataset (https://www.kaggle.com/datasets/hummaamqaasim/jobs-in-data) and prepares the data by:
- adding column with GBP salary 
- deleting column with USD salary which will not be used 

In [1]:
# Import dependencies
import pandas as pd
import numpy as np

In [2]:
# import the data
data = pd.read_csv("jobs_in_data.csv")

# check the dataframe
data.head(10)

Unnamed: 0,work_year,job_title,job_category,salary_currency,salary,salary_in_usd,employee_residence,experience_level,employment_type,work_setting,company_location,company_size
0,2023,Data DevOps Engineer,Data Engineering,EUR,88000,95012,Germany,Mid-level,Full-time,Hybrid,Germany,L
1,2023,Data Architect,Data Architecture and Modeling,USD,186000,186000,United States,Senior,Full-time,In-person,United States,M
2,2023,Data Architect,Data Architecture and Modeling,USD,81800,81800,United States,Senior,Full-time,In-person,United States,M
3,2023,Data Scientist,Data Science and Research,USD,212000,212000,United States,Senior,Full-time,In-person,United States,M
4,2023,Data Scientist,Data Science and Research,USD,93300,93300,United States,Senior,Full-time,In-person,United States,M
5,2023,Data Scientist,Data Science and Research,USD,130000,130000,United States,Senior,Full-time,Remote,United States,M
6,2023,Data Scientist,Data Science and Research,USD,100000,100000,United States,Senior,Full-time,Remote,United States,M
7,2023,Machine Learning Researcher,Machine Learning and AI,USD,224400,224400,United States,Mid-level,Full-time,In-person,United States,M
8,2023,Machine Learning Researcher,Machine Learning and AI,USD,138700,138700,United States,Mid-level,Full-time,In-person,United States,M
9,2023,Data Engineer,Data Engineering,USD,210000,210000,United States,Executive,Full-time,Remote,United States,M


In [3]:
# check data types to see if we need to change any
data.dtypes

work_year              int64
job_title             object
job_category          object
salary_currency       object
salary                 int64
salary_in_usd          int64
employee_residence    object
experience_level      object
employment_type       object
work_setting          object
company_location      object
company_size          object
dtype: object

In [4]:
# Check if there are any NAs
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9355 entries, 0 to 9354
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           9355 non-null   int64 
 1   job_title           9355 non-null   object
 2   job_category        9355 non-null   object
 3   salary_currency     9355 non-null   object
 4   salary              9355 non-null   int64 
 5   salary_in_usd       9355 non-null   int64 
 6   employee_residence  9355 non-null   object
 7   experience_level    9355 non-null   object
 8   employment_type     9355 non-null   object
 9   work_setting        9355 non-null   object
 10  company_location    9355 non-null   object
 11  company_size        9355 non-null   object
dtypes: int64(3), object(9)
memory usage: 877.2+ KB


After inspecting the data, we identified no duplicates. As such, we are not running any code to delete duplicates. 

If you wanted to get rid of them 

Clean duplicates
data.drop_duplicates(inplace = True, ignore_index = True)

Check duplicates 
data.info()

In [5]:
# As we are interested in knowing what the salaries are in GBP we are creating a new column with all salary values converted to GBP
# First check what currencies we have in the data
data.salary_currency.unique()

array(['EUR', 'USD', 'GBP', 'CAD', 'AUD', 'PLN', 'BRL', 'TRY', 'CHF',
       'SGD', 'DKK'], dtype=object)

In [6]:
# now convert salaries to GBP

salary_in_gbp = []

for i in range (0, len(data)):
   
    currency = data['salary_currency'][i]
    salary = data['salary'][i]
    
    if currency == "EUR":
        converted_salary = salary*0.856806
    elif currency == "USD":
        converted_salary = salary*0.797301
    elif currency == "CAD":
        converted_salary = salary*0.589401
    elif currency == "AUD":
        converted_salary = salary*0.517266
    elif currency == "PLN":
        converted_salary = salary*0.197459
    elif currency == "BRL":
        converted_salary = salary*0.160023
    elif currency == "TRY":
        converted_salary = salary*0.0260931
    elif currency == "CHF":
        converted_salary = salary*0.916607
    elif currency == "SGD":
        converted_salary = salary*0.592088
    elif currency == "DKK":
        converted_salary = salary*0.114866
    
    salary_in_gbp.append(converted_salary)
    
# add that column to the data
data['salary_in_gbp'] = salary_in_gbp
data['salary_in_gbp'] = data['salary_in_gbp'].astype('int64')

# drop salary columns we will not use
data.drop(["salary_currency", "salary"], axis = 1, inplace = True)

In [7]:
cleaned_df = data
cleaned_df.to_csv("cleaned_data.csv", index = False)

# Create tables for SQLite

In [8]:
# Create job_category table

