# K Nearest Neighbour
# Machine Learning Model on Salary Dataset

K Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for classification and regression tasks. It operates on the principle that objects are similar to those with which they share attributes. The algorithm assigns a class label or predicts a continuous value based on the majority class or average value of its k nearest neighbors in the feature space.
Here's how it works:
1. **Training Phase**: In the training phase, the algorithm simply memorizes the feature vectors and their corresponding class labels or values.
2. **Prediction Phase**:
   - For classification: Given a new data point, the algorithm identifies the k nearest neighbors to the query point based on some distance metric (commonly Euclidean distance). It then assigns the class label by majority voting among these neighbors.
   - For regression: Similarly, for regression tasks, the algorithm predicts the value for the query point by averaging the values of its k nearest neighbors.

**Where to use KNN**:
- KNN is a versatile algorithm used in various fields such as recommendation systems, pattern recognition, image recognition, and anomaly detection.
- It's particularly useful when the decision boundary is irregular or the data distribution is not well-defined.
- KNN can be applied to both classification and regression tasks.

**Data Requirements**:
- **Labeled Data**: KNN requires labeled training data, meaning each data point must have a known class label or value.
- **Feature Scaling**: It's important to scale features as KNN is sensitive to the scale of features. Normalization or standardization of features is often applied.
- **Distance Metric**: A distance metric needs to be chosen based on the characteristics of the data. Euclidean distance is commonly used, but other metrics like Manhattan distance or cosine similarity can be used depending on the problem.
- **Memory**: As an instance-based algorithm, KNN requires storing all training data in memory. This can be a limitation when dealing with very large datasets, as it can lead to high memory consumption and slow prediction times.

### Import

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor

In [29]:
df = pd.read_csv('D:\\Data Practice JN\\Pre-Processing\\Wrangled Data of Salary_Dataset.csv')
df

Unnamed: 0.1,Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,0,32.0,0,Bachelor's,Software Engineer,5.000000,115296.421831
1,1,28.0,1,Master's,Data Analyst,3.000000,65000.000000
2,2,45.0,0,PhD,Software Engineer,8.090834,150000.000000
3,3,36.0,0,Bachelor's,Sales Associate,7.000000,60000.000000
4,4,52.0,0,Master's,Director,8.000000,200000.000000
...,...,...,...,...,...,...,...
6545,6698,49.0,1,PhD,Director of Marketing,20.000000,200000.000000
6546,6699,32.0,0,High School,Sales Associate,3.000000,50000.000000
6547,6700,30.0,1,Bachelor's,Financial Manager,4.000000,55000.000000
6548,6701,46.0,0,Master's,Marketing Manager,14.000000,140000.000000


In [30]:
df.drop(columns=['Unnamed: 0'], inplace=True)
df.drop(columns=['Job Title'], inplace=True)
df.head()

Unnamed: 0,Age,Gender,Education Level,Years of Experience,Salary
0,32.0,0,Bachelor's,5.0,115296.421831
1,28.0,1,Master's,3.0,65000.0
2,45.0,0,PhD,8.090834,150000.0
3,36.0,0,Bachelor's,7.0,60000.0
4,52.0,0,Master's,8.0,200000.0


#### Label Encoding

In [31]:
# Create the DataFrame
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Other']}
p1= pd.DataFrame(data)

# Replace 'male' with 1 and 'female' with 2 in the 'Sex' column
df['Gender'] = df['Gender'].replace({'Male': 0, 'Female': 1, 'Other': 2})

# Print the DataFrame to check
print(df)

       Age Gender Education Level  Years of Experience         Salary
0     32.0      0      Bachelor's             5.000000  115296.421831
1     28.0      1        Master's             3.000000   65000.000000
2     45.0      0             PhD             8.090834  150000.000000
3     36.0      0      Bachelor's             7.000000   60000.000000
4     52.0      0        Master's             8.000000  200000.000000
...    ...    ...             ...                  ...            ...
6545  49.0      1             PhD            20.000000  200000.000000
6546  32.0      0     High School             3.000000   50000.000000
6547  30.0      1      Bachelor's             4.000000   55000.000000
6548  46.0      0        Master's            14.000000  140000.000000
6549  26.0      1     High School             1.000000   35000.000000

[6550 rows x 5 columns]


In [32]:
import pandas as pd

# Sample DataFrame 'p1' with the 'Educational Level' column
data = pd.DataFrame({'Education Level': ["Bachelor's", "Master's", "PhD", 'High School']})

# Replace educational levels with numerical values in the 'Educational Level' column of 'p1'
df['Education Level'] = df['Education Level'].replace({"Bachelor's": 0, "Master's": 1, "PhD": 2, 'High School': 3})

# Print the updated DataFrame to check
print(df)

       Age Gender  Education Level  Years of Experience         Salary
0     32.0      0                0             5.000000  115296.421831
1     28.0      1                1             3.000000   65000.000000
2     45.0      0                2             8.090834  150000.000000
3     36.0      0                0             7.000000   60000.000000
4     52.0      0                1             8.000000  200000.000000
...    ...    ...              ...                  ...            ...
6545  49.0      1                2            20.000000  200000.000000
6546  32.0      0                3             3.000000   50000.000000
6547  30.0      1                0             4.000000   55000.000000
6548  46.0      0                1            14.000000  140000.000000
6549  26.0      1                3             1.000000   35000.000000

[6550 rows x 5 columns]


  df['Education Level'] = df['Education Level'].replace({"Bachelor's": 0, "Master's": 1, "PhD": 2, 'High School': 3})


## Model Building

### Define Features and Label

In [33]:
x= df.iloc[ : ,:-1]
y= df.iloc[ : ,-1:]

In [34]:
df.head()

Unnamed: 0,Age,Gender,Education Level,Years of Experience,Salary
0,32.0,0,0,5.0,115296.421831
1,28.0,1,1,3.0,65000.0
2,45.0,0,2,8.090834,150000.0
3,36.0,0,0,7.0,60000.0
4,52.0,0,1,8.0,200000.0


In [35]:
x

Unnamed: 0,Age,Gender,Education Level,Years of Experience
0,32.0,0,0,5.000000
1,28.0,1,1,3.000000
2,45.0,0,2,8.090834
3,36.0,0,0,7.000000
4,52.0,0,1,8.000000
...,...,...,...,...
6545,49.0,1,2,20.000000
6546,32.0,0,3,3.000000
6547,30.0,1,0,4.000000
6548,46.0,0,1,14.000000


In [36]:
y

Unnamed: 0,Salary
0,115296.421831
1,65000.000000
2,150000.000000
3,60000.000000
4,200000.000000
...,...
6545,200000.000000
6546,50000.000000
6547,55000.000000
6548,140000.000000


### Train Test Split

In [37]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=0)

In [38]:
x_test

Unnamed: 0,Age,Gender,Education Level,Years of Experience
4660,36.0,0,2,11.0
1363,48.0,0,2,16.0
4032,29.0,1,3,1.0
3779,25.0,1,0,1.0
6269,26.0,0,3,2.0
...,...,...,...,...
562,24.0,0,0,1.0
95,39.0,1,0,12.0
3554,34.0,1,0,6.0
1050,28.0,1,0,5.0


In [39]:
x_train

Unnamed: 0,Age,Gender,Education Level,Years of Experience
3679,33.0,1,0,6.0
4908,26.0,0,0,5.0
3559,25.0,1,0,1.0
5723,39.0,1,0,16.0
2111,27.0,0,0,3.0
...,...,...,...,...
4931,25.0,1,0,3.0
3264,34.0,0,0,3.0
1653,43.0,0,2,13.0
2607,24.0,1,0,2.0


In [40]:
y_test

Unnamed: 0,Salary
4660,135000.0
1363,190000.0
4032,26000.0
3779,35000.0
6269,40000.0
...,...
562,90000.0
95,65000.0
3554,75000.0
1050,150000.0


In [41]:
y_train

Unnamed: 0,Salary
3679,75000.0
4908,85000.0
3559,35000.0
5723,200000.0
2111,80000.0
...,...
4931,65000.0
3264,50000.0
1653,185000.0
2607,55000.0


### Model Fitting

In [43]:
model = KNeighborsRegressor()
model.fit(x_train, y_train)

### Prediction

In [44]:
model.predict([[40, 0, 2, 12]])



array([[151059.28436613]])

In [45]:
model.predict([[40, 1, 2, 12]])



array([[160000.]])

In [46]:
model.predict([[40, 1, 3, 12]])



array([[160000.]])

In [49]:
model.predict([[40, 1, 0, 5]])



array([[79000.]])

### Evaluation of Model

In [50]:
print('Training Score of Model =',model.score(x_train, y_train))
print('Testing Score of Model = ',model.score(x_test, y_test))

Training Score of Model = 0.9100761289726689
Testing Score of Model =  0.901497339830531
