<a href="https://colab.research.google.com/github/YashNigam65/gitfolder/blob/master/notebook/supervised_classification/employee_salary_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is  the classic example of Supervised learning : Classification algorithm  (knn_frameworks) on the employee salary dataset with label encoder.

In [49]:
# KNN Classification

# step 1: import the required modules
from pandas import read_csv
import pandas as pd
# from sklearn.model_selection import KFold     # for dividing test and training data
# from sklearn.model_selection import cross_val_score #cross_val_score is typically used for evaluating model performance using cross-validation
# from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing # for label encoding
from sklearn.neighbors import KNeighborsClassifier  # for KNN algorithm


In [27]:
# We are not using cross_val_score and LogisticRegression in the main code because:
# - The focus here is on demonstrating KNN (KNeighborsClassifier) for classification, not Logistic Regression.
# - cross_val_score is typically used for evaluating model performance using cross-validation, but in this workflow,
# we are training and testing the model directly.

# Use of cross_val_score:
# - It helps estimate the accuracy of a model by splitting the data into multiple folds and training/testing on each fold.
# - Useful for comparing models and checking for overfitting.

# Use of LogisticRegression:
# - It is another classification algorithm, often used as a baseline for comparison.
# - Predicts categorical outcomes and is simple, fast, and interpretable.

# In summary:
# - We use KNN here for its simplicity and effectiveness on the Iris dataset.
# - cross_val_score and LogisticRegression are useful tools, but not required for this specific workflow.

Logistic Regression is a supervised machine learning algorithm used for classification tasks. It predicts the probability that an input belongs to a particular category (e.g., Yes/No, Spam/Not Spam, 0/1). Unlike linear regression, which predicts continuous values, logistic regression outputs values between 0 and 1 using a sigmoid function, making it suitable for binary and multi-class classification problems. It is simple, fast, and provides interpretable results, making it a common baseline model for classification.

cross_val_score
Runs K-Fold (or other CV strategies) behind the scenes and returns accuracy (or another metric) for each fold.

scores = cross_val_score(LogisticRegression(), X, y, cv=5)
print(scores)        # accuracies for each fold
print(scores.mean()) # average accuracy

LogisticRegression
What it is → A machine learning algorithm for classification.
Use → Predicts categorical outcomes (e.g., Yes/No, Spam/Not Spam, 0/1).
Why → Simple, fast, interpretable baseline algorithm for classification tasks.

In [50]:
# step 2: read the data from the csv file
filename = 'https://raw.githubusercontent.com/YashNigam65/gitfolder/refs/heads/master/dataset/Salary_Data.csv'
df = read_csv(filename)
print(df.head)  # show first 5 row
# display(df[df['Gender'].isna()])



<bound method NDFrame.head of      Age  Gender Education Level                      Job Title  \
0     32    Male      Bachelor's              Software Engineer   
1     28  Female        Master's                   Data Analyst   
2     45    Male             PhD                 Senior Manager   
3     36  Female      Bachelor's                Sales Associate   
4     52    Male        Master's                       Director   
..   ...     ...             ...                            ...   
368   35  Female      Bachelor's       Senior Marketing Analyst   
369   43    Male        Master's         Director of Operations   
370   29  Female      Bachelor's         Junior Project Manager   
371   34    Male      Bachelor's  Senior Operations Coordinator   
372   44  Female             PhD        Senior Business Analyst   

     Years of Experience  Salary  
0                    5.0   90000  
1                    3.0   65000  
2                   15.0  150000  
3                    7.0 

In [51]:
# step 3: do preprocessing - in this case do label encoding for the last column
# label_encoder object knows how to understand word labels.
# fit_transform() calculates the unique values in the 'flower_name' column
# and assigns a unique integer to each unique value
# (e.g., 'Iris-setosa' might become 0, 'Iris-versicolor' might become 1,
# and 'Iris-virginica' might become 2).
label_encoder = preprocessing.LabelEncoder()
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Education Level']= label_encoder.fit_transform(df['Education Level'])
df['Job Title']= label_encoder.fit_transform(df['Job Title'])
display(df)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32,1,0,159,5.0,90000
1,28,0,1,17,3.0,65000
2,45,1,2,130,15.0,150000
3,36,0,0,101,7.0,60000
4,52,1,1,22,20.0,200000
...,...,...,...,...,...,...
368,35,0,0,131,8.0,85000
369,43,1,1,30,19.0,170000
370,29,0,0,70,2.0,40000
371,34,1,0,137,7.0,90000


In [52]:
# step 4: split to input and output
# print(df['Gender'].unique())
#print(df.values)

array = df.values
inputx = array[:,0:5]
#print("\n",inputx,"\n")
outputy = array[:,5]
#print("\n",outputy,"\n")


# or
# you can do same thing as below
# iloc = index location
# It is used in Pandas DataFrames to select rows/columns by index number (not by name).

# inputx = df.iloc[:,0:4].values
# print("\n",inputx,"\n")
# outputy = df.iloc[:,4].values
# print("\n",outputy,"\n")



In [53]:
# step 5: select the model
model = KNeighborsClassifier()
print("\nThe model selected is",model)

# In essence, KNeighborsClassifier is a good starting point for this classification problem due to its simplicity,
# interpretability, and known effectiveness on similar datasets.
# Suitability for Classification: The Iris dataset is a classic classification problem where
# the goal is to predict the species of iris flower based on its measurements.
# KNN is a well-suited algorithm for multi-class classification tasks like this.



The model selected is KNeighborsClassifier()


In [54]:
# step 6: train or build the model
model.fit(inputx,outputy)



In [66]:
# step 7: do testing or model prediction
filename = 'https://raw.githubusercontent.com/YashNigam65/gitfolder/refs/heads/master/dataset/Salary_Data_test.csv'
newdataframe = read_csv(filename)
#print(newdataframe)
label_encoder = preprocessing.LabelEncoder()
newdataframe['Gender']= label_encoder.fit_transform(newdataframe['Gender'])
newdataframe['Education Level']= label_encoder.fit_transform(newdataframe['Education Level'])
newdataframe['Job Title']= label_encoder.fit_transform(newdataframe['Job Title'])
array = newdataframe.values
z = array[:,0:5]
print("\n",array,"\n")
res=model.predict(z)
print(res)
# reslist=[]





 [[26  1  0  7  5]
 [26  0  0  7  5]
 [28  0  1  0  3]
 [28  1  1  0  3]
 [45  1  2  5 15]
 [45  0  2  5 15]
 [36  0  0  4  7]
 [36  1  0  4  7]
 [52  1  1  1 20]
 [52  0  1  1 20]
 [29  1  0  2  2]
 [29  0  0  2  2]
 [42  0  1  3 12]
 [42  1  1  3 12]
 [38  1  2  6 10]
 [38  0  2  6 10]] 

[ 40000.  40000.  75000.  75000.  55000.  55000.  45000.  45000. 250000.
 250000.  75000.  55000.  45000.  45000.  45000.  45000.]


In [67]:
# Create a DataFrame from the input features (z) and add the predicted salaries (res)
result_df = pd.DataFrame(z, columns=['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience'])
result_df['Predicted Salary'] = res


# Display the resulting table
display(result_df)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Predicted Salary
0,26,1,0,7,5,40000.0
1,26,0,0,7,5,40000.0
2,28,0,1,0,3,75000.0
3,28,1,1,0,3,75000.0
4,45,1,2,5,15,55000.0
5,45,0,2,5,15,55000.0
6,36,0,0,4,7,45000.0
7,36,1,0,4,7,45000.0
8,52,1,1,1,20,250000.0
9,52,0,1,1,20,250000.0
