# ***Predicting Salaries: A Linear Regression Analysis on Workplace Variables***

## ***Hasri Akbar Awal Rozaq***

# **Abstract**

This study focuses on the application of linear regression analysis to examine the correlation between years of experience and salary in a dataset comprising diverse occupational backgrounds. By isolating these two critical variables, we aim to construct a predictive model that sheds light on the quantitative relationship between professional experience and compensation. The investigation seeks to discern the impact of years of experience on salary outcomes, providing valuable insights for employers and HR professionals. The results contribute to a refined understanding of the salary dynamics associated with varying levels of professional expertise, facilitating informed decision-making in the realm of workforce compensation.

# **Introduction**
In this project, the topic discussed is a regression system created to predict the salaries. This project was created to complete the project of the Data Science. To facilitate navigation on the page, please use the ***Table of Contents*** menu at the top left of the page.

# **A.I. Project Cycle**

In this project, I completed it using the A.I. Project Cycle framework which consists of several processes, namely:
1. Problem Scoping
2. Data Acquisition
3. Data Exploration
4. Modeling
5. Evaluation
6. Deployment

Here is a picture of the related framework.
![ai project cycle](https://github.com/akbarrozaq691/image-segmentation-leaf/assets/41296422/e98ba5e5-9c95-4a23-bd34-0dc7373da190)

## **Problem Scoping**

Problem scoping is a stage to find and formulate problems that will be solved using A.I., where we can determine the problem by describing the 4 Ws (What, Who, Where, Why). The following is a description of the 4 points above:

1. *What*: Salary prediction based on years of experience is a dynamic and essential aspect of workforce management.
2. *Who*: Professionals involved in human resources, data analysts, and individuals interested in salary trends and compensation strategies.
3. *Where*: The focus areas encompass salary prediction models, human resources analytics, and workforce planning, all emphasizing the influence of years of experience on earnings.
4. *Why*: Understanding the correlation between years of experience and salary is crucial for informed decision-making in workforce management.

From the provided details, it can be deduced that the central issue addressed is the prediction of salaries based on an individual's years of experience. This is a vital consideration for professionals in human resources, data analytics, and those interested in compensation strategies, as it contributes to fair and effective workforce management, talent retention, and overall human resources planning.

## **Data Acquisition**

Data acquisition is the stage to retrieve data from a problem. The most important process at this stage is to select variables related to the problem. As explained there are several important aspects that need to be taken, namely:

1. Years Experience: Years of working experience
2. Salary

Thus, we need to collect the data for further analysis.

The dataset used is taken from the website [Kaggle](https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression/data) with some information as follows:

![Salary](https://github.com/akbarrozaq691/MachineLearningCourse/assets/41296422/1dae2b96-e186-4435-9851-ac74dd547456)

Dataset Information:

Dataset: [Salary Dataset](https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression/data)

|Type                    |Description                                                                                                |
| ----------------------- |  -----------------------------------------------------------------------------------------------------   |
|Source                   |[Kaggle Dataset: Salary Dataset](https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression/data)|
|License                  |CC0: Public Domain                                                                                        |
|Category                 |Computer Science, Education, Linear Regression                                                                  |
|File Type and Size  |ZIP (457B)                                                                                               |

#### **Import the required *library*/modules**

In [1]:
# For deployment
import pickle

# For data process
import numpy as np
import pandas as pd

# For visualization
import plotly.express as px

# For model evaluation
from sklearn.metrics import r2_score

# For make a model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#### **Preparing the Dataset**

In [2]:
# Get the dataset
link = 'https://drive.google.com/file/d/1iDjalQyCMCjkhIKcakLPvy1MqZ8pSrLZ/view?usp=sharing'
link = 'https://drive.google.com/uc?id=' + link.split('/')[-2]
df = pd.read_csv(link)
del df[df.columns[0]]

In [3]:
# Show 5 dataset
df.head()

Unnamed: 0,YearsExperience,Salary
0,1.2,39344.0
1,1.4,46206.0
2,1.6,37732.0
3,2.1,43526.0
4,2.3,39892.0


## **Data Exploration**

Data exploration is a stage for digging in-depth information from a dataset, as well as cleaning the data if it has empty parts, outliers, and so on. Then, the cleaned data will be visualized.

### **Information of Data**

In [4]:
# Viewing dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   YearsExperience  30 non-null     float64
 1   Salary           30 non-null     float64
dtypes: float64(2)
memory usage: 608.0 bytes


It can be seen that the data used has **30** rows.

In [5]:
# Description of data
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
YearsExperience,30.0,5.413333,2.837888,1.2,3.3,4.8,7.8,10.6
Salary,30.0,76004.0,27414.429785,37732.0,56721.75,65238.0,100545.75,122392.0


In the information above, we can see that each variable is of the *float* data type. In addition, we can also find out information related to the **minimal**, **average**, and **maximum** values.

In [6]:
# Checking empty data
df.isnull().sum()

YearsExperience    0
Salary             0
dtype: int64

From the information above, we can be sure that the data we use is **full** and not empty.

### **Data Visualization**

We will visualize the data that has been obtained and ensure that the data distribution is linear and there are no outliers which aims to improve the accuracy of the program.

In [7]:
# Show vigure
fig=px.scatter(df, x="YearsExperience", y="Salary")
fig.show()

If we talk about linear regression problems, they cannot be separated from linear sentences. Linear itself is a function that forms a straight line graph between variable X and variable Y where it can also be said to have a negative or positive correlation. When we learn basic math, there is a gradient formula, 𝑦=m𝘟+𝙲 where m is the gradient value, X is the input value, and C is the coefficient value. There is no difference between linear regression and gradient. Here is the linear regression formula.

![equation](https://nosimpler.me/wp-content/uploads/2016/11/lr-formula-300x225.png)

So, m is represented by b, and C is represented by a. The figure above shows that work experience (years) has a linear line to salary. So the longer he works, the higher the pay.

## **Modelling**

#### **Splitting Data**

We need to separate data between training and test data, each with a different task. The training data will be used to train the algorithm to learn data patterns, while the test data will be used for data that will be predicted and has never been studied before, like students who are given exam questions that have never been studied.

This data separation does not have a fixed rule. Everything needs *trial and error* to find a good pattern in training the algorithm. Generally, the data is divided into proportions of (80%, 20%), (90%, 10%), and (70%, 30%), where the most significant percentage is used for training data and the rest becomes test data.

I used the 90% 10% proportion in this experiment because the data I got was only 30 rows.

In [8]:
# Prepare data
X = df[['YearsExperience']] # input data
y = df['Salary'].values # output data

In [9]:
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

In [10]:
print(f'Length of Train Data: {len(X_train)}')
print(f'Length of Test Data: {len(X_test)}')

Length of Train Data: 27
Length of Test Data: 3


#### **Make a Model**

I call the linear regression function from the scikit-learn library

In [11]:
model = LinearRegression()
model.fit(X_train, y_train)

In [12]:
# Get coeff value
model.coef_

array([9549.2879104])

In [13]:
# Get intercept value
model.intercept_

24890.8139770384

As explained earlier that the formula for linear regression is bx + a. So, the model we use is y = 24859.83 + 9437.68 * x.

In [14]:
def predict(x):
  rumus = 24859.83 + 9437.68 * x
  return rumus

The above function is later used to predict the test data.

## **Evaluation**

Data evaluation aims to see how well the model has been trained. Where there are three things that need to be considered:
1. Classification -> determines accuracy, precision and recall
2. Regression -> determine the error value
3. Clustering -> determine the centroid value

The cases that are resolved are regression data, therefore we consider the error value. The higher the error value, the less good the model. Conversely, if the error value is smaller then the model is very good.

There are several ways to evaluate, namely:
1. MSE (Mean Squared Error)
2. MAE (Mean Absolute Error)
3. MAPE (Mean Absolute Percentage Error)
4. RMSE (Root Mean Squared Error)
5. R2 Score

The model is good if the error value (MAPE) is <= 20% or the R2 Score value is > 70%.

Here is the formula for MAPE

<img src="https://miro.medium.com/v2/resize:fit:1400/1*b1nwcFArdmz9WZv5hzMoSw.png" alt="Your image title" width="320"/>

The predicted y will reduce the actual y and then be divided by the real y and the data amount.

Here is the formula for R2 Score

<img src="https://miro.medium.com/v2/resize:fit:556/1*JKxx5h_ThVfwqQuMhgBLqQ.png" alt="Your image title" width="320"/>

Slightly different from MAPE, the y value will also be reduced by the average y itself. The R2 value is between 0 and 1.


In [15]:
prediction = predict(X_test)

In [16]:
prediction

Unnamed: 0,YearsExperience
25,110742.718
4,46566.494
29,124899.238


In [17]:
#Function for calculate Mean Absolute Percentage Error (MAPE)
def calculate_mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [18]:
mape_lr = calculate_mape(y_test, np.array(prediction))
r2_lr = r2_score(y_test, prediction)
print(f'R2 Score from model: {round(r2_lr * 100, 2)}%')
print(f'MAPE Score from model: {round(mape_lr, 2)}%')

R2 Score from model: 97.87%
MAPE Score from model: 62.21%


We can see that the evaluation results state that the MAPE value is 36.43% and the R2 Score value is 91.24%, so it can be said that the algorithm successfully predicts well. Considering the slightly large MAPE value, we will next try with another model.

## **Deployment**

We can export the model that we have trained which we will later implement into the application so that the model that has been trained can be used continuously without training it repeatedly. Here I use pickle.

In [19]:
filename = 'model.pkl'
pickle.dump(model, open(filename, 'wb'))