# Employee Salary Prediction Using Multi-Linear Regression

You are working as a data scientist in a company that wants to predict the salaries of employees based on various factors. Your task is to implement two key functions that will help in building a predictive model using multi-linear regression and feature selection.

Given a dataset with the following features for each employee:

* Years of Experience
* Education Level (1 for High School, 2 for Bachelor’s, 3 for Master’s, 4 for PhD)
* Age
* Industry (1 for Tech, 2 for Finance, 3 for Healthcare)
* City (1 for City A, 2 for City B, 3 for City C)
* Salary (Target Variable)

You are required to create two functions:

* Feature Selection and Regression Model:

  * Perform feature selection by selecting the top 3 features that are most strongly correlated with the salary.
  * Train a multi-linear regression model using the selected features.
* Salary Prediction:

  * Predict the salary for a new employee using the trained model and the selected features.

Functions to Implement:
* feature_selection_and_regression(X, Y):
  Takes in:
  * X: A 2D array (or DataFrame) of shape (n, 5) representing n employees and their features (Years of Experience, Education Level, Age, Industry, City).
  * Y: A 1D array (or Series) representing the salary for each employee.
  * Objective: Perform feature selection to choose the top 3 most relevant features. Train a multi-linear regression model using these selected features.
  * Returns: The trained model, A list of the selected features.
* predict_salary(model, new_employee, selected_features):
  Takes in:
  * model: The trained regression model from the previous function.
  * new_employee: A list containing the features for a new employee in the format [Years of Experience, Education Level, Age, Industry, City].
  * selected_features: A list of the selected feature names from the previous function.
  * Objective: Predict the salary for the new employee using the trained model and selected features.
Returns:
The predicted salary as a float.


In [1]:
print("Hello, Begin Your Data Journey")


Hello, Begin Your Data Journey


In [2]:
!pip3 install sklearn

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
# import libraries
import pandas as pd
import numpy as np 
from sklearn.linear_model import LinearRegression


In [4]:
# Data Loading
import os
jupyter_notebook_dataset = os.getenv("dataset_url", "https://d3dyfaf3iutrxo.cloudfront.net/general/upload/87856638fddd4092bec161f1eebedf17.csv")
data = pd.read_csv(jupyter_notebook_dataset)


In [5]:
# The first 5 rows of the data
data.head(5)


Unnamed: 0,Years of Experience,Education Level,Age,Industry,City,Salary
0,7,3,31,3,2,30691
1,20,3,51,1,3,67471
2,29,2,46,1,3,80464
3,15,4,41,1,3,72188
4,11,4,26,3,2,39994


In [6]:
# Explore the dataset with null values and column type
data.isnull().sum()


Years of Experience    0
Education Level        0
Age                    0
Industry               0
City                   0
Salary                 0
dtype: int64

In [7]:
# function define for feature selection and model fitting
from sklearn.preprocessing import StandardScaler

def feature_selection_and_regression(X, Y):
#     pass
    # Convert X and Y to a DataFrame for easier handling
    X = pd.DataFrame(X, columns=['Years of Experience', 'Education Level', 'Age', 'Industry', 'City'])  # Replace with actual feature names
    y = pd.DataFrame(Y, columns=['Salary'])  # Replace with the name of the target variable
 
    
    

    # Calculate the absolute correlations between the 'Salary' column and all other numerical columns, 
    # excluding 'Salary' itself.  
#     correlations = X.corrwith(y).abs()
    correlations = data.corr()['Salary'].drop('Salary').abs()




    # Select the top 3 most correlated features and store it in variable called "selected_features"
    selected_features = correlations.nlargest(3).index.tolist()

    # Prepare the selected features in X
    X_selected = X[selected_features]

    # Initialize and train the regression model and store it in "model"
    model = LinearRegression()
    model.fit(X_selected, Y)

   

    # return model, selected_features
    return model, selected_features




In [8]:
def predict_salary(model, new_employee, selected_features):

# Convert new_employee into a DataFrame row with the same structure as X

    columns = ['Years of Experience', 'Education Level', 'Age', 'Industry', 'City']

    new_employee_dict = dict(zip(columns, new_employee))

    # Select the features used in the model

    selected_values = [new_employee_dict[feature] for feature in selected_features]

    # Predict salary using the model

    predicted_salary = model.predict([selected_values])

    return predicted_salary[0]


In [9]:
# fitting data 

X = data[['Years of Experience', 'Education Level', 'Age', 'Industry', 'City']]
Y = data['Salary']
new_employee = [6, 3, 30, 2, 2]


In [10]:
# Perform feature selection and regression
model, selected_features = feature_selection_and_regression(X, Y)


In [11]:
my_string = ' '.join(selected_features)


In [12]:
selected_features


['Years of Experience', 'Education Level', 'Age']

In [13]:
import json
# Convert the string to JSON format
my_string_json = json.dumps(my_string)

my_string_json


'"Years of Experience Education Level Age"'

In [14]:
# Predict salary for a new employee and store it in a variable "predicted_salary"
predicted_salary = predict_salary(model, new_employee, selected_features)




In [15]:
round(predicted_salary,2)


39345.56