## Recommender System for Matching HealthCare Professionals with Jobs Using Cosine Similarity

### Importing the relevant packages

In [1]:
import pandas as pd
import numpy as np
import random #for generating random numbers

### Generating a synthetic data.
In this data, we will use syntheric data with 2000 professionals and 500 different jobs, using the random package

In [2]:
# Defining the parameters to randomize the synthetic data
num_professionals = 2000
num_jobs = 500

In [3]:
# Defining the variables 
skills = ['nursing', 'physician', 'radiology', 'pharmacy', 'lab']
locations = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia']
certifications = ['BLS', 'ACLS', 'PALS', 'CPR', 'NRP']
education = ['Associate', 'Bachelor', 'Master', 'Doctorate']
work_preferences = ['part-time', 'full-time', 'day shift', 'night shift']
skill_levels = ['entry-level', 'intermediate', 'advanced']
max_experience_years = 30

In [4]:
# Generating synnthetic data for the job healthcare professionals 
professionals_df = pd.DataFrame(columns=['id', 'skills', 'location', 'certification', 'education', 'skill_level', 'work_preference', 'experience' ])
for i in range(num_professionals):
    id = i + 1,
    skills = random.choice(skills),
    location = random.choice(locations),
    certification = random.sample(certifications, random.randint(1, len(certifications))),
    education = random.choice(education),
    work_preference = random.choice(work_preferences),
    experience = random.randint(1, max_experience_years),
    skill_level = random.choice(skill_levels),
    professionals_df.loc[i] = [id, skills, location, certification, education, skill_level, work_preference, experience]
professionals_df.sample(5)
    
    

Unnamed: 0,id,skills,location,certification,education,skill_level,work_preference,experience
208,"(209,)","(nursing,)","(New York,)","([PALS, BLS],)","(Doctorate,)","(intermediate,)","(full-time,)","(20,)"
1890,"(1891,)","(nursing,)","(New York,)","([PALS, CPR, BLS, ACLS],)","(Doctorate,)","(intermediate,)","(day shift,)","(25,)"
1143,"(1144,)","(nursing,)","(New York,)","([PALS, CPR, ACLS],)","(Doctorate,)","(intermediate,)","(part-time,)","(7,)"
767,"(768,)","(nursing,)","(Philadelphia,)","([PALS, NRP],)","(Doctorate,)","(intermediate,)","(part-time,)","(26,)"
1269,"(1270,)","(nursing,)","(New York,)","([BLS],)","(Doctorate,)","(advanced,)","(full-time,)","(15,)"


The dataset looks ugly with the commas and the parentheses. Let us create a function to remove them

In [5]:
def clean_tuple(t):
    return str(t).replace("(", "").replace(",", "").replace(")", "").replace("'", "").replace("[", "").replace("]", "")
professionals_df = professionals_df.applymap(clean_tuple)
professionals_df.sample(5)

Unnamed: 0,id,skills,location,certification,education,skill_level,work_preference,experience
1970,1971,nursing,Chicago,NRP PALS,Doctorate,intermediate,part-time,7
318,319,nursing,New York,CPR NRP PALS ACLS,Doctorate,entry-level,day shift,3
1849,1850,nursing,Los Angeles,CPR NRP PALS BLS ACLS,Doctorate,intermediate,part-time,18
1406,1407,nursing,Houston,NRP CPR BLS,Doctorate,intermediate,day shift,10
798,799,nursing,Los Angeles,ACLS,Doctorate,intermediate,day shift,18


The dataset looks cleaner, so let us to the same for the jobs dataframe

In [22]:
jobs_df = pd.DataFrame(columns=['id', 'skills', 'location', 'certification_required', 'education_required', 'skill_level_required', 'work_preference', 'experience_required' ])
for i in range(num_jobs):
    id = i + 1,
    skills = random.choice(skills),
    location = random.choice(locations),
    certification = random.sample(certifications, random.randint(1, len(certifications))),
    education = random.choice(education),
    work_preference = random.choice(work_preferences),
    experience = random.randint(1, max_experience_years),
    skill_level = random.choice(skill_levels),
    jobs_df.loc[i] = [id, skills, location, certification, education, skill_level, work_preference, experience]
    
jobs_df = jobs_df.applymap(clean_tuple)
jobs_df.sample(5)
    

Unnamed: 0,id,skills,location,certification_required,education_required,skill_level_required,work_preference,experience_required
335,336,nursing,Los Angeles,ACLS CPR BLS NRP PALS,Doctorate,intermediate,day shift,10
185,186,nursing,Los Angeles,ACLS NRP PALS CPR BLS,Doctorate,advanced,day shift,1
373,374,nursing,Chicago,NRP ACLS BLS,Doctorate,advanced,part-time,6
476,477,nursing,Houston,CPR NRP,Doctorate,advanced,part-time,15
166,167,nursing,New York,PALS BLS,Doctorate,intermediate,full-time,25


In [23]:
jobs_df.shape, professionals_df.shape

((500, 8), (2000, 8))

### Exploratory Data Analysis
Now that we have two datasets, let us now proceed to explore it

In [24]:
professionals_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               2000 non-null   int64 
 1   skills           2000 non-null   object
 2   location         2000 non-null   object
 3   certification    2000 non-null   object
 4   education        2000 non-null   object
 5   skill_level      2000 non-null   object
 6   work_preference  2000 non-null   object
 7   experience       2000 non-null   int64 
dtypes: int64(2), object(6)
memory usage: 140.6+ KB


We need the id and the experience to be integers as seen in our dataframe


In [9]:
professionals_df.id = professionals_df.id.astype('int64')
professionals_df.experience = professionals_df.experience.astype('int64')
professionals_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               2000 non-null   int64 
 1   skills           2000 non-null   object
 2   location         2000 non-null   object
 3   certification    2000 non-null   object
 4   education        2000 non-null   object
 5   skill_level      2000 non-null   object
 6   work_preference  2000 non-null   object
 7   experience       2000 non-null   int64 
dtypes: int64(2), object(6)
memory usage: 140.6+ KB


Let us now investigate the data types of the jobs variables

In [25]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      500 non-null    object
 1   skills                  500 non-null    object
 2   location                500 non-null    object
 3   certification_required  500 non-null    object
 4   education_required      500 non-null    object
 5   skill_level_required    500 non-null    object
 6   work_preference         500 non-null    object
 7   experience_required     500 non-null    object
dtypes: object(8)
memory usage: 35.2+ KB


Similarly, we need the experience and the id to be integers 

In [27]:
jobs_df.id = jobs_df.id.astype('int64')
jobs_df.experience_required = jobs_df.experience_required.astype('int64')
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      500 non-null    int64 
 1   skills                  500 non-null    object
 2   location                500 non-null    object
 3   certification_required  500 non-null    object
 4   education_required      500 non-null    object
 5   skill_level_required    500 non-null    object
 6   work_preference         500 non-null    object
 7   experience_required     500 non-null    int64 
dtypes: int64(2), object(6)
memory usage: 35.2+ KB


### Feature Engineering
We will perform similarity matching to match the healthcare professionals to job opportunities based on the similarity of their skills and experience

In [28]:
jobs_df.sample(4)

Unnamed: 0,id,skills,location,certification_required,education_required,skill_level_required,work_preference,experience_required
182,183,nursing,New York,ACLS CPR,Doctorate,entry-level,day shift,19
191,192,nursing,Chicago,PALS,Doctorate,advanced,part-time,30
444,445,nursing,Philadelphia,PALS,Doctorate,advanced,full-time,28
369,370,nursing,Philadelphia,NRP,Doctorate,entry-level,night shift,29


In [29]:
professionals_df.sample(4)

Unnamed: 0,id,skills,location,certification,education,skill_level,work_preference,experience
1008,1009,nursing,Los Angeles,NRP BLS CPR,Doctorate,intermediate,part-time,10
425,426,nursing,Philadelphia,NRP CPR ACLS PALS BLS,Doctorate,advanced,night shift,24
1568,1569,nursing,Los Angeles,NRP ACLS CPR,Doctorate,entry-level,full-time,13
1074,1075,nursing,New York,PALS,Doctorate,intermediate,full-time,13


We will match the two dataframes based on the skills variable

In [30]:
merged_df = pd.merge(professionals_df, jobs_df, on='skills', how='inner', suffixes=('_pro', '_job'))
merged_df.sample(4)

Unnamed: 0,id_pro,skills,location_pro,certification,education,skill_level,work_preference_pro,experience,id_job,location_job,certification_required,education_required,skill_level_required,work_preference_job,experience_required
367631,736,nursing,Los Angeles,PALS,Doctorate,intermediate,part-time,24,132,Philadelphia,PALS CPR ACLS,Doctorate,entry-level,part-time,2
623763,1248,nursing,New York,BLS NRP CPR ACLS PALS,Doctorate,advanced,day shift,11,264,Los Angeles,BLS ACLS PALS,Doctorate,entry-level,day shift,20
957492,1915,nursing,Houston,PALS CPR ACLS BLS,Doctorate,intermediate,day shift,17,493,New York,CPR NRP BLS,Doctorate,advanced,full-time,24
977894,1956,nursing,New York,NRP PALS CPR BLS,Doctorate,intermediate,day shift,8,395,Philadelphia,ACLS NRP PALS,Doctorate,entry-level,part-time,26


We have 15 different variables and a million rows

In [31]:
merged_df.shape, merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 15 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   id_pro                  1000000 non-null  int64 
 1   skills                  1000000 non-null  object
 2   location_pro            1000000 non-null  object
 3   certification           1000000 non-null  object
 4   education               1000000 non-null  object
 5   skill_level             1000000 non-null  object
 6   work_preference_pro     1000000 non-null  object
 7   experience              1000000 non-null  int64 
 8   id_job                  1000000 non-null  int64 
 9   location_job            1000000 non-null  object
 10  certification_required  1000000 non-null  object
 11  education_required      1000000 non-null  object
 12  skill_level_required    1000000 non-null  object
 13  work_preference_job     1000000 non-null  object
 14  experience_required

((1000000, 15), None)

### User-item Matrix
We will create a user-item matrix. But before then, we will have to perform some encoding operations on some of out variables since the user-item matrix is essentially a matrix of numerical values. 
In our case, we will perform one-hot encoding on columns like skill, location_pro, certification, education, skill_level, and work_preference_pro columns.

In [32]:
from sklearn.preprocessing import LabelEncoder