In [1]:
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt
#importing libraries

dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()
# loading data

df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])
#data cleanup

In this section, this will involve applying functions to specific columns so that those values are adjusted. In the job_skills column, note that the list of skills are contained as a string. To clean this up, we can implement the apply method.

**Note: whenever we are carrying out alterations on the original dataframe, we want to create a copy of this so that we can avoid the 'SettingWithCopyWarning'.**


In [14]:
df_salary = df[pd.notna(df['salary_year_avg'])].copy()
df_salary.head()
# here, we use the pd.notna method as the condtion to filter out all the NA values 
# and only return a dataframe with actual numerical values in the salary_year_avg column

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
28,Data Scientist,CRM Data Specialist,"San José Province, San José, Costa Rica",via Ai-Jobs.net,Full-time,False,Costa Rica,2023-08-01 13:37:57,False,False,Costa Rica,year,109500.0,,Netskope,"['gdpr', 'excel']","{'analyst_tools': ['excel'], 'libraries': ['gd..."
77,Data Engineer,Data Engineer,"Arlington, VA",via LinkedIn,Full-time,False,Sudan,2023-06-26 14:22:54,False,False,Sudan,year,140000.0,,Intelletec,"['mongodb', 'mongodb', 'python', 'r', 'sql', '...","{'analyst_tools': ['tableau'], 'cloud': ['orac..."
92,Data Engineer,Remote - Data Engineer - Permanent - W2,Anywhere,via LinkedIn,Full-time,True,"Illinois, United States",2023-02-21 13:29:59,False,True,United States,year,120000.0,,Apex Systems,"['sql', 'python']","{'programming': ['sql', 'python']}"
100,Data Scientist,"Data Scientist, Risk Data Mining - USDS","Mountain View, CA",via LinkedIn,Full-time,False,"California, United States",2023-07-31 13:01:18,False,True,United States,year,228222.0,,TikTok,"['sql', 'r', 'python', 'express']","{'programming': ['sql', 'r', 'python'], 'webfr..."
109,Data Analyst,Senior Supply Chain Analytics Analyst,Anywhere,via Get.It,Full-time,True,"Illinois, United States",2023-10-12 13:02:19,False,True,United States,year,89000.0,,Get It Recruit - Transportation,"['python', 'r', 'alteryx', 'tableau']","{'analyst_tools': ['alteryx', 'tableau'], 'pro..."


Now that we have cleaned the column, we want to calculate the projected salary for next year. Here, we are going to multiply each of the salaries by the inflation rate of 3.0% (x1.03). To do this, we can use the method:

- **df.apply()**

Which include the parameters func, axis etc. 

- func: function to apply to each column or row

We can quickly navigate to the documentatin with the following code:

In [6]:
help(df.apply)

Help on method apply in module pandas.core.frame:

apply(func: 'AggFuncType', axis: 'Axis' = 0, raw: 'bool' = False, result_type: "Literal['expand', 'reduce', 'broadcast'] | None" = None, args=(), by_row: "Literal[False, 'compat']" = 'compat', engine: "Literal['python', 'numba']" = 'python', engine_kwargs: 'dict[str, bool] | None' = None, **kwargs) method of pandas.core.frame.DataFrame instance
    Apply a function along an axis of the DataFrame.
    
    Objects passed to the function are Series objects whose index is
    either the DataFrame's index (``axis=0``) or the DataFrame's columns
    (``axis=1``). By default (``result_type=None``), the final return type
    is inferred from the return type of the applied function. Otherwise,
    it depends on the `result_type` argument.
    
    Parameters
    ----------
    func : function
        Function to apply to each column or row.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis along which the function is applied:
 

First thing we have to do is create a function to apply to a given column:

In [15]:
def projected_salary(salary):
    return salary * 1.03


df_salary['salary_year_inflated'] = df_salary['salary_year_avg'].apply(projected_salary)
df_salary[['salary_year_avg', 'salary_year_inflated']]
# this returns new values (salary_year_inflated) as the function has been applied to all of the values
# in the salary_year_avg column. Also because we aree applying the function name, we do not have to add ()

Unnamed: 0,salary_year_avg,salary_year_inflated
28,109500.0,112785.00
77,140000.0,144200.00
92,120000.0,123600.00
100,228222.0,235068.66
109,89000.0,91670.00
...,...,...
785624,139216.0,143392.48
785641,150000.0,154500.00
785648,221875.0,228531.25
785682,157500.0,162225.00


Given that the project_salary function is short, we can try and rewrite this as a lambda/anonymous function

In [18]:
df_salary['salary_year_inflated'] = df_salary['salary_year_avg'].apply(lambda salary: salary * 1.03)
df_salary[['salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,salary_year_avg,salary_year_inflated
28,109500.0,112785.00
77,140000.0,144200.00
92,120000.0,123600.00
100,228222.0,235068.66
109,89000.0,91670.00
...,...,...
785624,139216.0,143392.48
785641,150000.0,154500.00
785648,221875.0,228531.25
785682,157500.0,162225.00


### Converting 'job_skills' column to list datatype
Because the function is very simple, we could even just simplify the code to:

df_salary['salary_year_inflated'] = df_salary['salary_year_avg']* 1.03 

But this was just to show an introduction to the df.apply() method when it is ran on a column. Now remember we need to convert the in the 'job_skills' column in the original dataframe from a string to a list.

In [23]:
type(df['job_skills'][1])
#returns the values for this in the row corresponding to index 1

str

If we recall the lesson on datatypes, we can use the list() function but this will convert every single letter to a list. 

Instead, we are going to import the 'ast' module (which stands for Abstract Syntax Trees) and use the 
ast.literal_eval(node/string) function that converts it to its container datatype, which is a list.

In [26]:
import ast

type(ast.literal_eval(df['job_skills'][1]))

# Note: if I try to run the code to update the column: df['job_skills'] = ast.literal_eval(df['job_skills']),
# this will not work and provide a value error. Therefore, we need to use the apply method in this case 

list

We can now incorporate this inside of our own defined function. Note that because this column contains NA values, this will inrtoduce errors. Therefore, we type:

- if pd.notna(skill_list):
  
and this will bypass all the NA values in that column. Also, when we call the function clean_up, we do not have to specify the parameter 'skill_list'as we are performing this on the 'job_skills' column. 

In [50]:
def clean_up(skill_list):
    if pd.notna(skill_list):
        return ast.literal_eval(skill_list)

df['job_skills'] = df['job_skills'].apply(clean_up)

In [61]:
type(df['job_skills'][1])

list

We have now converted the data in this column into a list. We can repeat the same code in lambda format:

In [59]:
df['job_skills'] = df['job_skills'].apply(lambda skill_list: ast.literal_eval(skill_list) if pd.notna(skill_list) else skill_list) 

# Note that this only works if you pass a string input. Once it is converted to a list, this code does 
# not work

### Calculate Projected Salary Next Year (apply() to a row) based on condition

Previously, we have used the .apply() method to a column, but what if we want to apply this to a row? Previously we went through and applied 3% inflation to every single salary, but let's say other senior roles assume a projected salary increase of 5%. How can we use this method to adjust for thi condition?


In [81]:
def projected_sal(row):
    if 'Senior' in row ['job_title_short']:
        return row['salary_year_avg'] * 1.05

In [82]:
df_salary['salary_year_inflated'] = df_salary.apply(projected_sal, axis=1)
df_salary[['job_title_short', 'salary_year_avg', 'salary_year_inflated']]

Unnamed: 0,job_title_short,salary_year_avg,salary_year_inflated
28,Data Scientist,109500.0,
77,Data Engineer,140000.0,
92,Data Engineer,120000.0,
100,Data Scientist,228222.0,
109,Data Analyst,89000.0,
...,...,...,...
785624,Data Engineer,139216.0,
785641,Data Engineer,150000.0,
785648,Data Scientist,221875.0,
785682,Data Scientist,157500.0,


To check, we first look for rows containing the names 'Senior' and check by the index to see if there is a a corresponding value for this in the 'salary_year_inflated' column.

In [87]:
df_salary[df_salary['job_title_short'].str.contains('Senior', case=False, na=False)].head()

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills,salary_year_inflated
495,Senior Data Engineer,Senior Software Engineer (Data),"San Francisco, CA",via LinkedIn,Full-time,False,"Illinois, United States",2023-06-20 13:07:37,False,True,United States,year,168500.0,,hackajob,"['scala', 'gcp', 'azure', 'spark', 'kafka', 'h...","{'cloud': ['gcp', 'azure'], 'libraries': ['spa...",176925.0
573,Senior Data Engineer,Senior Python Data Engineer,"Wilmington, DE",via Indeed,Full-time,False,Sudan,2023-09-16 13:13:50,False,False,Sudan,year,160000.0,,Crackajack Solutions,"['python', 'sql', 'java', 'aws', 'databricks',...","{'cloud': ['aws', 'databricks', 'redshift'], '...",168000.0
657,Senior Data Engineer,Senior Data Engineer | Series D Video Analytic...,"Culver City, CA",via LinkedIn,Full-time,False,Georgia,2023-10-09 14:07:46,False,True,United States,year,165000.0,,Coda Search│Staffing,"['python', 'scala', 'sql', 'aws', 'redshift', ...","{'cloud': ['aws', 'redshift'], 'libraries': ['...",173250.0
726,Senior Data Engineer,Senior Data Engineer (Hybrid),"Washington, DC",via Linux Careers,Full-time,False,"California, United States",2023-05-02 13:09:09,False,True,United States,year,173500.0,,Capital One,"['java', 'scala', 'python', 'nosql', 'sql', 's...","{'cloud': ['redshift', 'snowflake', 'aws', 'az...",182175.0
733,Senior Data Engineer,Senior Data Engineer,"Oakland, CA",via LinkedIn,Full-time,False,Sudan,2023-07-06 13:41:35,False,False,Sudan,year,160000.0,,X4 Life Sciences,"['python', 'sql', 'postgresql', 'sql server', ...","{'cloud': ['aws', 'snowflake'], 'databases': [...",168000.0


In [93]:
df_salary.loc[[495, 573, 657, 726, 733, 28]]

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills,salary_year_inflated
495,Senior Data Engineer,Senior Software Engineer (Data),"San Francisco, CA",via LinkedIn,Full-time,False,"Illinois, United States",2023-06-20 13:07:37,False,True,United States,year,168500.0,,hackajob,"['scala', 'gcp', 'azure', 'spark', 'kafka', 'h...","{'cloud': ['gcp', 'azure'], 'libraries': ['spa...",176925.0
573,Senior Data Engineer,Senior Python Data Engineer,"Wilmington, DE",via Indeed,Full-time,False,Sudan,2023-09-16 13:13:50,False,False,Sudan,year,160000.0,,Crackajack Solutions,"['python', 'sql', 'java', 'aws', 'databricks',...","{'cloud': ['aws', 'databricks', 'redshift'], '...",168000.0
657,Senior Data Engineer,Senior Data Engineer | Series D Video Analytic...,"Culver City, CA",via LinkedIn,Full-time,False,Georgia,2023-10-09 14:07:46,False,True,United States,year,165000.0,,Coda Search│Staffing,"['python', 'scala', 'sql', 'aws', 'redshift', ...","{'cloud': ['aws', 'redshift'], 'libraries': ['...",173250.0
726,Senior Data Engineer,Senior Data Engineer (Hybrid),"Washington, DC",via Linux Careers,Full-time,False,"California, United States",2023-05-02 13:09:09,False,True,United States,year,173500.0,,Capital One,"['java', 'scala', 'python', 'nosql', 'sql', 's...","{'cloud': ['redshift', 'snowflake', 'aws', 'az...",182175.0
733,Senior Data Engineer,Senior Data Engineer,"Oakland, CA",via LinkedIn,Full-time,False,Sudan,2023-07-06 13:41:35,False,False,Sudan,year,160000.0,,X4 Life Sciences,"['python', 'sql', 'postgresql', 'sql server', ...","{'cloud': ['aws', 'snowflake'], 'databases': [...",168000.0
28,Data Scientist,CRM Data Specialist,"San José Province, San José, Costa Rica",via Ai-Jobs.net,Full-time,False,Costa Rica,2023-08-01 13:37:57,False,False,Costa Rica,year,109500.0,,Netskope,"['gdpr', 'excel']","{'analyst_tools': ['excel'], 'libraries': ['gd...",


As you can see, all the values with 'Senior' have a corresponding value in the 'salary_year_inflated' column whilst the Data Scientist role (index 28) does not. Therefore, applying the function to the rows has worked.