**K-Nearest Neighbors**

We are going to build a KNN model that would recommend jobs based on your experience and skills

Key Concept: The user should enter the skills and experience they have and the knn model would return k nearest plots matching the skills and experience and after it return the skills and experience, then the company and lpa will be returned to the user with the help of hashing

**Importing the nexcessary packages**

pandas - to load csv files, datacleaning and preprocessing before vectorization

In [3]:
import pandas as pd

numpy - to handle numerical values, helps to build knn which works with only numerical data

In [5]:
import numpy as np

CountVectorizer - to convert the text(skills) to vectors

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

for scaling-> using minmax scaler

In [7]:
from sklearn.preprocessing import MinMaxScaler

importing the knn model

In [8]:
from sklearn.neighbors import NearestNeighbors

Loading the dataset

In [9]:
df = pd.read_csv("job_dataset.csv")

In [10]:
df

Unnamed: 0,company_name,lpa,skills,experience_needed
0,Netflix,28.3,"Python, C++, SQL",2
1,Microsoft,12.0,"Docker, SQL",3
2,Amazon,34.2,"HTML, Django, Node.js",4
3,Meta,33.3,"Node.js, Deep Learning, Python, Java",3
4,Microsoft,35.5,"AWS, Java, Deep Learning, Machine Learning, C++",1
...,...,...,...,...
245,Microsoft,11.5,"Python, AWS, SQL, CSS, C++",8
246,Tesla,29.1,"Docker, Deep Learning, JavaScript",3
247,Accenture,23.9,"Kubernetes, React, AWS",4
248,Microsoft,12.8,"React, HTML, Machine Learning, Java",4


creating a hashed columns for company names for final output

In [11]:
df["company_hash"]=df["company_name"].apply(lambda x:hash(x)%100000)

mapping the companies and their lpa

In [12]:
output_hash_map={
    idx:{
        "company_hash":row["company_hash"],
        "lpa":row["lpa"]
    }
    for idx,row in df.iterrows()
}

reverse mapping to find the company name with the help of company_hash"

In [15]:
company_reverse_map={
    row["company_hash"]:row["company_name"]
    for _,row in df.iterrows()
}

 Data Cleaning and preprocessing to make there is no invalid data in skill section and to convert the skills as a string(" ") to python-list format

In [17]:
def clean_skills(text):
  parts=text.split(",")
  parts=[p.strip().lower() for p in parts]
  parts=[p for p in parts if p]
  return parts
df["skills_clean"]=df["skills"].apply(clean_skills)
df[["skills_clean","skills"]].head()

Unnamed: 0,skills_clean,skills
0,"[python, c++, sql]","Python, C++, SQL"
1,"[docker, sql]","Docker, SQL"
2,"[html, django, node.js]","HTML, Django, Node.js"
3,"[node.js, deep learning, python, java]","Node.js, Deep Learning, Python, Java"
4,"[aws, java, deep learning, machine learning, c++]","AWS, Java, Deep Learning, Machine Learning, C++"


Since the KNN model cannot interpret texts, we will start vectorizing the skills

In [23]:
# Convert list of skills into space-separated strings
df["skills_joined"] = df["skills_clean"].apply(lambda skills: " ".join(skills))

# Initialize the vectorizer
skill_vectorizer = CountVectorizer()

# Fit and transform to get the skill vectors
skill_vectors = skill_vectorizer.fit_transform(df["skills_joined"])

# Convert to array for KNN usage
skill_vectors = skill_vectors.toarray()

# Show the vocabulary (all skills)
print("Skill Vocabulary:", skill_vectorizer.get_feature_names_out()[:20])
print("Vector shape:", skill_vectors.shape)

Skill Vocabulary: ['aws' 'css' 'deep' 'django' 'docker' 'html' 'java' 'javascript' 'js'
 'kubernetes' 'learning' 'machine' 'node' 'python' 'react' 'sql']
Vector shape: (250, 16)


Normalization - since the skills are in 0/1 , big integer experience would affect the knn model to find the neighbors based on their skills so we have are normalizing the experience years to be in the range of [0,1]

In [26]:
exp_values=df[["experience_needed"]]
exp_scaler=MinMaxScaler()
exp_normalised=exp_scaler.fit_transform(exp_values)
df["experience_normalised"]=exp_normalised
df[["experience_needed","experience_normalised"]].head()

Unnamed: 0,experience_needed,experience_normalised
0,2,0.2
1,3,0.3
2,4,0.4
3,3,0.3
4,1,0.1


Constructing KNN vectors to build knn model

In [27]:
exp_vectors=df["experience_normalised"].values.reshape(-1,1)
knn_vectors=np.hstack((skill_vectors,exp_vectors))
print("Skill Vectors Shape: ",skill_vectors.shape)
print("Experience vector shape: ",exp_vectors.shape)
print("KNN Vectors Shape :",knn_vectors.shape)

Skill Vectors Shape:  (250, 16)
Experience vector shape:  (250, 1)
KNN Vectors Shape : (250, 17)


Let's train the KNN model to find 5 nearest neighbors and use cosine similarity to calculate the distance

In [28]:
knn_model=NearestNeighbors(
    n_neighbors=5,
    metric="cosine"

)
knn_model.fit(knn_vectors)

Now that the model is trained, let's check the model's output with the help of user's input

In [32]:
user_skills=["python","sql","c++"]
user_experience=3
user_skills_joined=" ".join(user_skills)
user_skill_vector=skill_vectorizer.transform([user_skills_joined])
user_exp_norm=exp_scaler.transform([[user_experience]])
user_vector=np.hstack((user_skill_vector.toarray(),user_exp_norm))
distances,indices=knn_model.kneighbors(user_vector)
print("Recommended Jobs")
for idx in indices[0]:
    company_hash = output_hash_map[idx]["company_hash"]
    lpa = output_hash_map[idx]["lpa"]
    company_name = company_reverse_map[company_hash]

    print(f"Company: {company_name}")
    print(f"LPA: {lpa}")
    print(f"Skills: {df.loc[idx, 'skills_clean']}")
    print(f"Experience Needed: {df.loc[idx, 'experience_needed']} years")
    print("-" * 40)

Recommended Jobs
Company: Netflix
LPA: 28.3
Skills: ['python', 'c++', 'sql']
Experience Needed: 2 years
----------------------------------------
Company: Adobe
LPA: 39.4
Skills: ['java', 'python', 'sql']
Experience Needed: 0 years
----------------------------------------
Company: Oracle
LPA: 36.0
Skills: ['deep learning', 'sql', 'python', 'c++']
Experience Needed: 6 years
----------------------------------------
Company: Amazon
LPA: 23.2
Skills: ['python', 'javascript', 'css', 'sql']
Experience Needed: 5 years
----------------------------------------
Company: Netflix
LPA: 22.7
Skills: ['c++', 'sql']
Experience Needed: 4 years
----------------------------------------


