## Salary Prediction

### Table of Contents

1. [Problem Statement](#Problem-Statement)
2. [Libraries](#Libraries)
3. [Data Loading](#Data-Loading)
4. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
   - [Dataset Summary](#Dataset-Summary)
   - [Data Quality Assessment](#Data-Quality-Assessment)
        - [Missing Values](#Missing-Values)
        - [Duplicates](#Duplicates)
   - [Dataset Distribution](#Dataset-Distribution)
5. [Data Preprocessing](#Data-Preprocessing)
6. [Model Building, Training and Hyperparameter Tuning](#Model-Building-and-Training)
     1. [Model 1: Random Forest](#Model-1:-Random-Forest)
     2. [Model 2: Logistic Regression](#Model-2:-Logistic-Regression)
7. [Model Evaluation](#Model-Evaluation)
8. [Results and Insights](#Results-and-insights)
9.  [Future Work](#Future-Work)
10. [Conclusion](#Conclusion)


#### Problem Statement

#### Libraries

In [76]:
import pandas as pd
import plotly.express as px
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

#### Data Loading

In [56]:
# load the dataset
salary_df = pd.read_csv("data/salary_data.csv")

# display the columns
salary_df

Unnamed: 0.1,Unnamed: 0,Job Title,Location,Size,Type of ownership,Industry,Sector,hourly,employer_provided,min_salary,max_salary,avg_salary,job_state,python_yn,R_yn,spark,aws,excel,job_simp,seniority
0,0,Data Scientist,"Albuquerque, NM",501 to 1000 employees,Company - Private,Aerospace & Defense,Aerospace & Defense,0,0,53,91,72.0,NM,1,0,0,0,1,data scientist,na
1,1,Healthcare Data Scientist,"Linthicum, MD",10000+ employees,Other Organization,Health Care Services & Hospitals,Health Care,0,0,63,112,87.5,MD,1,0,0,0,0,data scientist,na
2,2,Data Scientist,"Clearwater, FL",501 to 1000 employees,Company - Private,Security Services,Business Services,0,0,80,90,85.0,FL,1,0,1,0,1,data scientist,na
3,3,Data Scientist,"Richland, WA",1001 to 5000 employees,Government,Energy,"Oil, Gas, Energy & Utilities",0,0,56,97,76.5,WA,1,0,0,0,0,data scientist,na
4,4,Data Scientist,"New York, NY",51 to 200 employees,Company - Private,Advertising & Marketing,Business Services,0,0,86,143,114.5,NY,1,0,0,0,1,data scientist,na
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
737,737,"Sr Scientist, Immuno-Oncology - Oncology","Cambridge, MA",10000+ employees,Company - Public,Biotech & Pharmaceuticals,Biotech & Pharmaceuticals,0,0,58,111,84.5,MA,0,0,0,1,0,na,senior
738,738,Senior Data Engineer,"Nashville, TN",1001 to 5000 employees,Company - Public,Internet,Information Technology,0,0,72,133,102.5,TN,1,0,1,1,0,data engineer,senior
739,739,"Project Scientist - Auton Lab, Robotics Institute","Pittsburgh, PA",501 to 1000 employees,College / University,Colleges & Universities,Education,0,0,56,91,73.5,PA,0,0,0,0,1,na,na
740,740,Data Science Manager,"Allentown, PA",1 to 50 employees,Company - Private,Staffing & Outsourcing,Business Services,0,0,95,160,127.5,PA,0,0,0,0,1,manager,na


#### Exploratory Data Analysis

##### Dataset Summary

In [57]:
# display the dataset's shape
print(f"Dataset's shape: {salary_df.shape}\n")

# display the summary of the dataset
print("Dataset Description:")
print(salary_df.describe())

# display the information - column data types - of the dataset
print("\nDataset info:")
print(salary_df.info())

# display the number of categorical and numerical columns
num_categorical = salary_df.select_dtypes(include='object').shape[1]
num_numerical = salary_df.select_dtypes(include=['int64', 'float64']).shape[1]
print(f"\nNumber of categorical columns: {num_categorical}")
print(f"Number of numerical columns: {num_numerical}")


Dataset's shape: (742, 20)

Dataset Description:
       Unnamed: 0      hourly  ...         aws       excel
count  742.000000  742.000000  ...  742.000000  742.000000
mean   370.500000    0.032345  ...    0.237197    0.522911
std    214.341239    0.177034  ...    0.425651    0.499812
min      0.000000    0.000000  ...    0.000000    0.000000
25%    185.250000    0.000000  ...    0.000000    0.000000
50%    370.500000    0.000000  ...    0.000000    1.000000
75%    555.750000    0.000000  ...    0.000000    1.000000
max    741.000000    1.000000  ...    1.000000    1.000000

[8 rows x 11 columns]

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 742 entries, 0 to 741
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         742 non-null    int64  
 1   Job Title          742 non-null    object 
 2   Location           742 non-null    object 
 3   Size               742 non-null 

**Observations from the results:**
1. The dataset contains 742 rows and 20 columns, with the first column serving as the unique identifier for each row.
2. There are 9 categorical columns and 11 numerical columns.
3. The target variable is the `avg_salary` column, which is a numerical column.
4. Some columns are indicated as `numeric`, but they are `boolean` columns. These columns are `python`, `r`, `spark`, `aws`, `excel`, `job_simp`, `seniority`, `desc_len`, `num_comp`. They indicate whether the skill is required or not.

##### Dataset Quality Assessment

In [58]:
# check for missing values
missing_values = salary_df.isna().sum()
print(f"Dataset Missing values count:")
print(missing_values)

# check for duplicates
duplicates = salary_df[salary_df.duplicated()].count()
print(f"Dataset Duplicates count:")
print(duplicates)

Dataset Missing values count:
Unnamed: 0           0
Job Title            0
Location             0
Size                 0
Type of ownership    0
Industry             0
Sector               0
hourly               0
employer_provided    0
min_salary           0
max_salary           0
avg_salary           0
job_state            0
python_yn            0
R_yn                 0
spark                0
aws                  0
excel                0
job_simp             0
seniority            0
dtype: int64
Dataset Duplicates count:
Unnamed: 0           0
Job Title            0
Location             0
Size                 0
Type of ownership    0
Industry             0
Sector               0
hourly               0
employer_provided    0
min_salary           0
max_salary           0
avg_salary           0
job_state            0
python_yn            0
R_yn                 0
spark                0
aws                  0
excel                0
job_simp             0
seniority            0
dtype: int6

##### Dataset Distribution

In [59]:
# function to display the unique values in all columns
def check_unique_values(dataframe: pd.DataFrame):
    """display the unique values in categorical columns and columns with less than 10 unique values

        :Input:
            dataframe: a python pandas dataframe

        :Output:
            unique values in the dataframe

        >>> check_unique_values(salary_prediction_dataframe)
        Unique values in 'spark' column: False, True
        """
    
    # Iterate over each column in the DataFrame
    column_list = []

    index = 0
    for column in dataframe.columns:
        if dataframe[column].dtype == 'object' or len(dataframe[column].unique()) < 10:
            unique_values = dataframe[column].unique()
            unique_values_str = [str(value) for value in unique_values]
            print(
                f"Unique values in '{column}' column: {', '.join(unique_values_str)}")
            column_list.append(index)
            index += 1

    # return column_list


check_unique_values(salary_df)

# This helps identify the categorical and boolean columns

Unique values in 'Job Title' column: Data Scientist, Healthcare Data Scientist, Research Scientist, Staff Data Scientist - Technology, Data Analyst, Data Engineer I, Scientist I/II, Biology, Customer Data Scientist, Data Scientist - Health Data Analytics, Senior Data Scientist / Machine Learning, Data Scientist - Quantitative, Digital Health Data Scientist, Associate Data Analyst, Clinical Data Scientist, Data Scientist / Machine Learning Expert, Web Data Analyst, Senior Data Scientist, Data Engineer, Data Scientist - Algorithms & Inference, Scientist, Lead Data Scientist, Spectral Scientist/Engineer, College Hire - Data Scientist - Open to December 2019 Graduates, Data Scientist, Office of Data Science, Data Science Analyst, Senior Risk Data Scientist, Data Scientist in Artificial Intelligence Early Career, Data Scientist - Research, R&D Data Analysis Scientist, Analytics Consultant, Director, Data Science, Data Scientist SR, R&D Sr Data Scientist, Customer Data Scientist/Sales Engine

In [60]:
# set the boolean columns
boolean_columns = ['hourly', 'employer_provided', 'python_yn', 'R_yn', 'spark', 'aws', 'excel']

In [61]:
# display correlation heatmaps for the numerical columns
numerical_columns = salary_df.select_dtypes(include=['int64', 'float64']).columns

# exclude the boolean columns
numerical_columns = numerical_columns.difference(boolean_columns)

correlation_matrix = salary_df[numerical_columns].corr()

fig = px.imshow(correlation_matrix, text_auto=True, width=800, height=600)
fig.show()

# using matplotlib
# plt.figure(figsize=(12, 8))
# sns.heatmap(correlation_matrix)
# plt.show()


In [62]:
# check the value counts for seniority
salary_df['seniority'].value_counts()

seniority
na        520
senior    220
jr          2
Name: count, dtype: int64

In [63]:
# check the value counts for the job_simp
salary_df['job_simp'].value_counts()

job_simp
data scientist    279
na                184
data engineer     119
analyst           102
manager            22
mle                22
director           14
Name: count, dtype: int64

In [64]:
# display the count of values where their seniority is 'na' and the job simp is 'na'
salary_df[(salary_df['seniority'] == 'na') & (salary_df['job_simp'] == 'na')].count()

Unnamed: 0           127
Job Title            127
Location             127
Size                 127
Type of ownership    127
Industry             127
Sector               127
hourly               127
employer_provided    127
min_salary           127
max_salary           127
avg_salary           127
job_state            127
python_yn            127
R_yn                 127
spark                127
aws                  127
excel                127
job_simp             127
seniority            127
dtype: int64

In [65]:
# check the average salary, grouping by the job title

# first get the unique job titles
# len(salary_df["Job Title"].unique())
# len(salary_df['job_simp'].unique())
salary_df["avg_salary"].groupby(salary_df['Job Title']).value_counts()

Job Title                                           avg_salary
Ag Data Scientist                                   80.5          1
Analytics - Business Assurance Data Analyst         43.0          2
Analytics Consultant                                66.5          1
Analytics Manager                                   87.5          2
Analytics Manager - Data Mart                       64.0          4
                                                                 ..
System and Data Analyst                             59.0          1
Systems Engineer II - Data Analyst                  62.5          2
Technology-Minded, Data Professional Opportunities  70.5          2
VP, Data Science                                    124.5         2
Web Data Analyst                                    106.0         1
Name: count, Length: 432, dtype: int64

In [66]:
# display the Type of Ownership
values = salary_df['Type of ownership'].value_counts(
    ascending=False).values

keys = salary_df['Type of ownership'].value_counts(
    ascending=False).keys()

bar_chart = px.bar(x=keys, y=values, color=values, text=values)

bar_chart.update_layout(
    yaxis_title="Type of Ownership",
    xaxis_title="Count",
    title="Type of Ownership Distribution",
)

bar_chart.show()

In [67]:
# display the Sector
values = salary_df['Sector'].value_counts(
    ascending=False).values

keys = salary_df['Sector'].value_counts(
    ascending=False).keys()

bar_chart = px.bar(x=keys, y=values, color=values, text=values)

bar_chart.update_layout(
    yaxis_title="Sector",
    xaxis_title="Count",
    title="Sector Distribution",
)

bar_chart.show()

In [68]:
# display the Industry distribution
values = salary_df['Industry'].value_counts(
    ascending=False).values

keys = salary_df['Industry'].value_counts(
    ascending=False).keys()

bar_chart = px.bar(x=keys, y=values, color=values, text=values)

bar_chart.update_layout(
    yaxis_title="Industry",
    xaxis_title="Count",
    title="Industry Distribution",
)

bar_chart.show()

In [69]:
# display the Size
values = salary_df['Size'].value_counts(
    ascending=False).values

keys = salary_df['Size'].value_counts(
    ascending=False).keys()

bar_chart = px.bar(x=keys, y=values, color=values, text=values)

bar_chart.update_layout(
    yaxis_title="Size",
    xaxis_title="Count",
    title="Size Distribution",
)

bar_chart.show()

In [70]:
# Calculate average salary for each seniority group
average_salaries = salary_df.groupby('seniority')['avg_salary'].mean().reset_index()

# Create a bar chart
bar_chart = px.bar(
    average_salaries,
    x='seniority', 
    y='avg_salary',
    text='avg_salary',
    color='avg_salary',
    title="Average Salary by Seniority"
)

# Customize layout
bar_chart.update_layout(
    xaxis_title="Seniority",
    yaxis_title="Average Salary",
    legend_title="Average Salary"
)

bar_chart.show()


#### Data Preprocessing

##### Column Headers Manipulation

In [71]:
# put all columns into lowercase, replace spaces with underscores
salary_df.columns = salary_df.columns.str.lower().str.replace(" ", "_")

# drop the unnamed column
salary_df = salary_df.drop(columns=['unnamed:_0'])

In [72]:
# ensure the boolean columns are present in the DataFrame
boolean_columns = pd.Index(boolean_columns).str.lower()
boolean_columns = boolean_columns.intersection(salary_df.columns)

# change the data types for the boolean columns
salary_df[boolean_columns] = salary_df[boolean_columns].astype(bool)

salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 742 entries, 0 to 741
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   job_title          742 non-null    object 
 1   location           742 non-null    object 
 2   size               742 non-null    object 
 3   type_of_ownership  742 non-null    object 
 4   industry           742 non-null    object 
 5   sector             742 non-null    object 
 6   hourly             742 non-null    bool   
 7   employer_provided  742 non-null    bool   
 8   min_salary         742 non-null    int64  
 9   max_salary         742 non-null    int64  
 10  avg_salary         742 non-null    float64
 11  job_state          742 non-null    object 
 12  python_yn          742 non-null    bool   
 13  r_yn               742 non-null    bool   
 14  spark              742 non-null    bool   
 15  aws                742 non-null    bool   
 16  excel              742 non

##### Column Transformation

In [75]:
# set the categorical columns
categorical_columns = salary_df.select_dtypes(include='object').columns

In [None]:
# Preprocess features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns),
        # ('num', StandardScaler(), numerical_columns) # I don't think there is a need to scale, I might be wrong though (I will compare performances)
        ('bool', 'passthrough', boolean_columns)
    ]
)

ct_pipeline = Pipeline([
        ('preprocessing', preprocessor),
        ('lr_model', LogisticRegression(random_state=42, max_iter=2000)),
        ('rf_model', RandomForestRegressor(n_estimators=100))
    ])


ct_pipeline