##### The Following is derived from a case study I completed on recommendations. For privacy and confidentiality, I've altered all of the IDS, locations, and job types in the dataset.

# K-Nearest Neighbors Weighting

### In February 2021, I completed a case with a large corporation whose business model relies on recommendation algorithms. The case asked for predictions based on the data offered. Here, I improve upon that case by experimenting with weighting variables in a KNN algorithm in order to give more importance to certain factors.

### Background: The dataset includes nearly 4000 users who have applied to a job. We have information on the user's career level, their location, their education, and we have information on the job they applied to: its location, education, and type.

### Our task is to predict for each user what three jobs they are likely to apply to next. Let's start.

In [1]:
#imports
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [2]:
#Let's look at the data

df = pd.read_csv('case_data.csv')
df.head()

Unnamed: 0,USERID,JOBID,UserState,CAREERLEVEL,UserEd,JobState,JobEd,JobType
0,003650a8-ae65-4d0d-9211-f9b7ae2ba87f,1c91a60f-8c98-4f4e-a8d2-b2b907ffa959,Illinois,Manager (Manager/Supervisor of Staff),Some College Coursework Completed,Illinois,High School or equivalent,CONSTRUCTION
1,be6510b5-6a32-4123-bcb0-1691423ab4d3,1c91a60f-8c98-4f4e-a8d2-b2b907ffa959,Illinois,Student (High School),Unspecified,Illinois,High School or equivalent,CONSTRUCTION
2,1cce0c54-be4c-471f-aa2a-468e55ca5924,1c91a60f-8c98-4f4e-a8d2-b2b907ffa959,Illinois,Manager (Manager/Supervisor of Staff),Some College Coursework Completed,Illinois,High School or equivalent,CONSTRUCTION
3,73bcb589-68c0-47f0-bf7b-c8ac5ef7353a,c82d255a-eae5-46e7-b727-7ba59737f184,Montana,Experienced (Non-Manager),Bachelors Degree,Montana,Bachelors Degree,MEDICAL
4,79e36f7f-d84a-4876-8f30-e97576c9b5bf,9eb0aaf0-8bc3-4e11-80df-53d78020c439,Illinois,Experienced (Non-Manager),Bachelors Degree,Massachusetts,Bachelors Degree,FINANCE


#### Remove any whitespace from columns

In [3]:
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

## EDA

#### Let's look at some of the locations

In [5]:
df.UserState.sample(20)

2138          Bristol
1264          Alabama
795        California
2676             Ohio
1600         New York
3585          Arizona
3057         Colorado
1224            Texas
3243    Massachusetts
1456         New York
2577         Virginia
3062    Massachusetts
1109            Texas
1691         Colorado
3486          Florida
2325          Florida
2558         New York
703             Texas
3520          Florida
1083            Texas
Name: UserState, dtype: object

#### Notice Bristol--It looks like there are foreign locations, so let's find out which ones

In [6]:
state_names = ["Alaska", "Alabama", "Arkansas", "American Samoa", "Arizona", "California", "Colorado", "Connecticut", "District ", "of Columbia", "Delaware","District of Columbia", "Florida", "Georgia", "Guam", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", "New York", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Virginia", "Virgin Islands", "Vermont", "Washington", "Wisconsin", "West Virginia", "Wyoming"]
#Create df of foreign locations
foreign = df[df.UserState.apply(lambda x: False if x in state_names else True)]
print(foreign.UserState.value_counts())

Madrid        48
London        39
Manchester    37
Bristol       21
Birmingham    20
Paris         17
Prague        10
Wales          6
Ontario        6
Dublin         5
Alberta        3
Berlin         2
Rome           2
Sao Paulo      2
Tokyo          2
Scotland       1
Z�rich         1
Quebec         1
Brussels       1
Kyoto          1
Frankfurt      1
Name: UserState, dtype: int64


#### Where are those in foreign countries applying?

In [7]:
#lets look at the locations of the jobs of users applying in Madrid
print(foreign[foreign.UserState=='Madrid']['JobState'].value_counts())

Madrid     45
Bristol     3
Name: JobState, dtype: int64


In [9]:
print(foreign[foreign.UserState=='London']['JobState'].value_counts())

London        31
Prague         2
Bristol        2
Scotland       2
Birmingham     1
Name: JobState, dtype: int64


#### Looks like most users in other countries are applying to jobs in their city and country, but there are exceptions

In [10]:
print('UNIQUE STATES: ',sorted(list(df.UserState.unique()), key=str.lower))

UNIQUE STATES:  ['Alabama', 'Alberta', 'Arizona', 'Arkansas', 'Berlin', 'Birmingham', 'Bristol', 'Brussels', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Dublin', 'Florida', 'Frankfurt', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Kyoto', 'London', 'Louisiana', 'Madrid', 'Maine', 'Manchester', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Ontario', 'Oregon', 'Paris', 'Pennsylvania', 'Prague', 'Puerto Rico', 'Quebec', 'Rhode Island', 'Rome', 'Sao Paulo', 'Scotland', 'South Carolina', 'Tennessee', 'Texas', 'Tokyo', 'Utah', 'Vermont', 'Virgin Islands', 'Virginia', 'Wales', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming', 'Z�rich']


#### Let's replace that value in Zurich (It's been imported as ''Z�rich")

In [11]:
df['UserState'] = df['UserState'].replace('Z�rich','Zurich')
print(sorted(list(df.UserState.unique()), key=str.lower, reverse=True)[0])

Zurich


## Regions

### Let's generalize all locations and create regions so that if there aren't matching jobs in a state, we can look at nearby places

### If I'm applying to a Job in Boston, I'm more likely to be interested in jobs in New York City than in Los Angeles

In [17]:
us_state_abbrev = {'Alabama': 'AL','Alaska': 'AK','American Samoa': 'AS','Arizona': 'AZ','Arkansas': 'AR','California': 'CA',
    'Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE','District of Columbia': 'DC','Florida': 'FL','Georgia': 'GA',
    'Guam': 'GU','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL','Indiana': 'IN','Iowa': 'IA','Kansas': 'KS','Kentucky': 'KY',
    'Louisiana': 'LA','Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN',
    'Mississippi': 'MS','Missouri': 'MO','Montana': 'MT','Nebraska': 'NE','Nevada': 'NV','New Hampshire': 'NH',
    'New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC','North Dakota': 'ND','Northern Mariana Islands':'MP',
    'Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA','Puerto Rico': 'PR','Rhode Island': 'RI',
    'South Carolina': 'SC','South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX','Utah': 'UT','Vermont': 'VT',
    'Virgin Islands': 'VI','Virginia': 'VA','Washington': 'WA','West Virginia': 'WV','Wisconsin': 'WI','Wyoming': 'WY'}

state_regions = {'AK': 'O','AL': 'S','AR': 'S','AS': 'O','AZ': 'W','CA': 'W','CO': 'W','CT': 'N','DC': 'N','DE': 'N',
        'FL': 'S','GA': 'S','GU': 'O','HI': 'O','IA': 'M','ID': 'W','IL': 'M','IN': 'M','KS': 'M','KY': 'S','LA': 'S',
        'MA': 'N','MD': 'N','ME': 'N','MI': 'W','MN': 'M','MO': 'M','MP': 'O','MS': 'S','MT': 'W','NA': 'O','NC': 'S',
        'ND': 'M','NE': 'W','NH': 'N','NJ': 'N','NM': 'W','NV': 'W','NY': 'N','OH': 'M','OK': 'S','OR': 'W','PA': 'N',
        'PR': 'O','RI': 'N','SC': 'S','SD': 'M','TN': 'S','TX': 'S','UT': 'W','VA': 'S','VI': 'O','VT': 'N','WA': 'W',
        'WI': 'M','WV': 'S','WY': 'W'}

foreign = {'Madrid':'SP','London':'UK','Manchester':'UK','Bristol':'UK','Birmingham':'UK','Paris':'FR',
           'Prague':'CR','Ontario':'CAN', 'Wales':'UK','Dublin':'IR','Alberta':'CAN','Rome':'IT',
           'Tokyo':'JP','Berlin':'GER','Sao Paulo':'BRA','Zurich':'SW','Brussels':'BEL','Quebec':'CAN',         
           'Scotland':'UK','Kyoto':'JP','Frankfurt':'GER','Normandy':'FR','Galway':'IR','Vienna':'AUS','Amsterdam':'NETH',
          'Toronto':'CAN','Dresden':'GER'}


#### Let's create a variable for country

In [18]:
df['job_country'] = df.JobState.apply(lambda x: 'USA' if x in list(us_state_abbrev.keys()) else (foreign.get(x,np.NaN)))
df['user_country'] = df.UserState.apply(lambda x: 'USA' if x in list(us_state_abbrev.keys()) else (foreign.get(x,np.NaN)))

#### And one for region

In [19]:
def get_region(row, column):
        if row[column] in list(us_state_abbrev.keys()):
            return state_regions[us_state_abbrev[row[column]]]
        elif row[column] in list(foreign.keys()):
            return foreign[row[column]]
        else:
            return np.NaN

df['job_region'] = df.apply(lambda x: get_region(x, 'JobState'), axis=1)
df['user_region'] = df.apply(lambda x: get_region(x, 'UserState'), axis=1)

#### Note: The region variable won't add much value to positions outside of the U.S. for multiple reasons: for one, there aren't enough observations for region to pull in similar jobs within a region. Second, my knowledge of European regional geography leaves much to be desired, so with my limited time, researching all of the localities and dividing the nations seemed like an inefficient use of my time.

#### As a substitute, I added the country variable: rather than recommending a position in Italy to a user who applied to a position in the U.K., this variable led the algorithm to recommend other positions in the U.K. In addition, within the U.S., this variable added weight to positions within the country even if they weren’t in the same region.

In [20]:
print('Number of entries without a region: ', len(df[df.job_region.isnull()]))

Number of entries without a region:  5


In [21]:
df[df.job_region.isnull()]

Unnamed: 0,USERID,JOBID,UserState,CAREERLEVEL,UserEd,JobState,JobEd,JobType,job_country,user_country,job_region,user_region
38,e89cd6fa-ee9e-4473-b7d9-065fd1e09ae9,0efcbc05-a30e-4575-8c7d-972a8c0d1776,Pennsylvania,Experienced (Non-Manager),Vocational - High School,,High School or equivalent,CONSTRUCTION,,USA,,N
1736,2981a7c0-a9cf-46f6-a609-ffac2207244b,4208d49a-7b5a-4713-894d-8ad23f83428a,West Virginia,Experienced (Non-Manager),Unspecified,,High School or equivalent,CONSTRUCTION,,USA,,S
1940,1afd76ab-942e-4388-9dcb-32b5bb31c09d,af6c4a58-f5cc-4fa0-b743-8d7cbf62cf30,Dublin,Entry Level,Vocational,,Bachelors Degree,FINANCE,,IR,,IR
2817,7cf17b97-2d96-4f1d-b8a9-261980f72d6f,fcf7cc45-d398-4f3b-88ba-dc51ad8cdb13,London,Experienced (Non-Manager),Bachelors Degree,,Bachelors Degree,FINANCE,,UK,,UK
3162,39c6e386-f43f-45b6-9298-6650ea182bba,df7d2e52-0326-462e-92f2-1e14bfcfc5f1,California,Manager (Manager/Supervisor of Staff),Bachelors Degree,,Bachelors Degree,FINANCE,,USA,,W


In [22]:
df.sample(10)

Unnamed: 0,USERID,JOBID,UserState,CAREERLEVEL,UserEd,JobState,JobEd,JobType,job_country,user_country,job_region,user_region
1700,750f75d8-593d-437d-99ca-06bf257ef4d3,d5280b53-2a33-482a-b097-c4e3f4c09f30,Texas,Manager (Manager/Supervisor of Staff),Unspecified,Texas,High School or equivalent,CONSTRUCTION,USA,USA,S,S
1981,6e74179d-48e7-4f10-8a6e-c132555069f5,615dcf4b-e3ad-4641-a20a-f1c265fe8291,Indiana,Experienced (Non-Manager),Unspecified,Indiana,High School or equivalent,CONSTRUCTION,USA,USA,M,M
2031,5fd89aca-d67f-4c26-8cdd-2ef994e6e518,332e5aed-fc6f-422a-bf1a-515657e5c777,New York,Manager (Manager/Supervisor of Staff),Bachelors Degree,New York,High School or equivalent,CONSTRUCTION,USA,USA,N,N
3486,ff2fbb96-4e60-4490-ab98-7a960714c0fe,8c66ef11-4947-4c00-8b0a-b5d7a7dc65ee,Florida,Student (undergraduate/graduate),Unspecified,Florida,High School or equivalent,CONSTRUCTION,USA,USA,S,S
1863,1d337763-9ec7-46a8-8fbd-115075b597c5,4cf6e5c4-129f-4056-88f7-cf2d6042bd96,Washington,Student (undergraduate/graduate),Bachelors Degree,New York,Bachelors Degree,FINANCE,USA,USA,N,W
1642,76e6a621-2a64-4462-9d16-3d7a87f91273,f5fe742e-8bc9-4b93-8f3a-d624934e2855,Kentucky,Experienced (Non-Manager),Unspecified,Texas,Bachelors Degree,FINANCE,USA,USA,S,S
1289,0fbbfb25-57ef-440f-848b-4401d68310bb,b93b1658-b73d-4f85-ab9e-ecb74f2b4c04,New Jersey,Student (undergraduate/graduate),High School or equivalent,New York,High School or equivalent,FINANCE,USA,USA,N,N
3484,74f6fe3a-951d-4e81-847f-c78b910b90d7,74b8f864-5953-4499-b7ca-92019250f100,Tennessee,Experienced (Non-Manager),Unspecified,Tennessee,High School or equivalent,MEDICAL,USA,USA,S,S
1393,49da6afb-a252-4ec1-b41e-67bc897309e4,a886101e-2c1b-44ba-9679-4c752896d369,West Virginia,Experienced (Non-Manager),Some College Coursework Completed,Ohio,Professional,CONSTRUCTION,USA,USA,M,S
2765,4949f335-b861-4205-bd62-7f70c238b86f,1b96bcae-671e-4355-963d-e9affc36de26,Texas,Manager (Manager/Supervisor of Staff),Unspecified,Texas,Bachelors Degree,FINANCE,USA,USA,S,S


### Location exploration

#### Are applicants applying to jobs in different countries?

In [23]:
print('Count of users applying outside their country: ', len(df[df.apply(lambda x: True if x['job_country'] != x['user_country'] else False, axis=1)]))

Count of users applying outside their country:  41


#### Looks like only a few, but lets see if there any patterns among them

In [24]:
df[df.apply(lambda x: True if x['job_country'] != x['user_country'] else False, axis=1)]

Unnamed: 0,USERID,JOBID,UserState,CAREERLEVEL,UserEd,JobState,JobEd,JobType,job_country,user_country,job_region,user_region
38,e89cd6fa-ee9e-4473-b7d9-065fd1e09ae9,0efcbc05-a30e-4575-8c7d-972a8c0d1776,Pennsylvania,Experienced (Non-Manager),Vocational - High School,,High School or equivalent,CONSTRUCTION,,USA,,N
200,2581517a-d08a-4da0-8942-aeb495179a94,8db5cf9f-c6e7-45f4-930c-c8578774f6bb,Prague,Student (undergraduate/graduate),Unspecified,London,Certification,CONSTRUCTION,UK,CR,UK,CR
268,5d85427e-1512-4727-8147-ff4ffbc8ec17,8c786fdb-8ed5-4c7d-bfd3-549bc1627f96,Dublin,Experienced (Non-Manager),Unspecified,New York,Bachelors Degree,FINANCE,USA,IR,N,IR
645,5d85427e-1512-4727-8147-ff4ffbc8ec17,9f94bdb1-13c4-4c2e-bef6-a4c36b626cf3,Dublin,Experienced (Non-Manager),Unspecified,New York,Bachelors Degree,FINANCE,USA,IR,N,IR
894,60978e79-4d23-40f0-bd0c-6a7b420adb60,88adc908-af65-470e-b375-b66b059b5e1a,Prague,Student (undergraduate/graduate),Unspecified,London,High School or equivalent,CONSTRUCTION,UK,CR,UK,CR
1167,b57068ab-99b2-4b19-8244-f7af03c69430,af7422e3-9ec7-4f14-813f-05f803872d49,Tokyo,Experienced (Non-Manager),Masters Degree,New York,Some College Coursework Completed,FINANCE,USA,JP,N,JP
1595,c6fc0b81-83e0-4e84-97a1-e454490bcfa2,c10a827c-e48d-4ff1-ad17-f704073f1cc1,London,Experienced (Non-Manager),Unspecified,Prague,Some High School Coursework,CONSTRUCTION,CR,UK,CR,UK
1649,59465faf-2c21-4dc8-9457-65349ee33fc9,0db15dbd-0f88-4a0a-a74e-2081a7b00157,Sao Paulo,Experienced (Non-Manager),Bachelors Degree,Massachusetts,Bachelors Degree,FINANCE,USA,BRA,N,BRA
1731,df943a4d-1d8c-4f8e-ac8e-c46b86499d60,d18466d1-9877-4269-a781-b875a61410e2,Bristol,Entry Level,Unspecified,Madrid,Some High School Coursework,CONSTRUCTION,SP,UK,SP,UK
1736,2981a7c0-a9cf-46f6-a609-ffac2207244b,4208d49a-7b5a-4713-894d-8ad23f83428a,West Virginia,Experienced (Non-Manager),Unspecified,,High School or equivalent,CONSTRUCTION,,USA,,S


#### Looks like the majority of the jobs are in finance: let's see if finance jobs are more likely to apply farther away

In [25]:
print('Percent of FINANCE jobs applying to same region: ', round(len(df[df.apply(lambda x: True if (x['JobType']=='FINANCE') & 
    (x['job_region']==x['user_region']) else False, axis=1)])/len(df[df['JobType']=='FINANCE']),2),'\n')

print('Percent of MEDICAL jobs applying to same region: ', round(len(df[df.apply(lambda x: True if (x['JobType']=='MEDICAL') & 
    (x['job_region']==x['user_region']) else False, axis=1)])/len(df[df['JobType']=='MEDICAL']),2),'\n')

print('Percent of CONSTRUCTION jobs applying to same region: ', round(len(df[df.apply(lambda x: True if (x['JobType']=='CONSTRUCTION') & 
    (x['job_region']==x['user_region']) else False, axis=1)])/len(df[df['JobType']=='CONSTRUCTION']),2),'\n')

Percent of FINANCE jobs applying to same region:  0.82 

Percent of MEDICAL jobs applying to same region:  0.9 

Percent of CONSTRUCTION jobs applying to same region:  0.94 



#### Looks like finance jobs are less likely to apply to the same region, but not by a wide enough margin where I want to factor it into the model

### Feature Correlations

#### I want to see if there's variance between variables and location, so let's create a column for users applying to jobs far away

In [26]:
df['state_diff'] = df.apply(lambda x: 1 if x['JobState'] != x['UserState'] else 0, axis=1)
df['region_diff'] = df.apply(lambda x: 1 if x['job_region'] != x['user_region'] else 0, axis=1)
df['country_diff'] = df.apply(lambda x: 1 if x['job_country'] != x['user_country'] else 0, axis=1)

In [27]:
df.sample(2)

Unnamed: 0,USERID,JOBID,UserState,CAREERLEVEL,UserEd,JobState,JobEd,JobType,job_country,user_country,job_region,user_region,state_diff,region_diff,country_diff
1250,33ea7cb9-e01f-4b4f-b069-d59ce700a7eb,ce7ae102-a5f3-4177-a609-5ababc073203,Wisconsin,Entry Level,Unspecified,Wisconsin,High School or equivalent,CONSTRUCTION,USA,USA,M,M,0,0,0
1718,101d051f-56db-4256-b7f1-c97dbc80741a,97a3d05e-34ae-4dad-ab85-5c36ffaf2321,Texas,Entry Level,High School or equivalent,Texas,Certification,CONSTRUCTION,USA,USA,S,S,0,0,0


In [28]:
#Create dictionary to get totals
def compare_totals(df, var):
    var_total  = df.groupby([var]).agg({'USERID':'count',"state_diff":"sum","region_diff":"sum","country_diff":"sum"}).reset_index()      
    totals = pd.Series(var_total['USERID'].values,index=var_total[var]).to_dict() 
    var_total['total'] = var_total[var].map(totals) 
    var_total['state_diff'] = round(var_total['state_diff']/var_total['total'],3)
    var_total['region_diff'] = round(var_total['region_diff']/var_total['total'],3)
    var_total['country_diff'] = round(var_total['country_diff']/var_total['total'],3)
    return var_total


In [29]:
compare_totals(df, 'CAREERLEVEL')

Unnamed: 0,CAREERLEVEL,USERID,state_diff,region_diff,country_diff,total
0,Entry Level,789,0.279,0.112,0.009,789
1,"Executive (SVP, VP, Department Head, etc)",106,0.255,0.104,0.0,106
2,Experienced (Non-Manager),1288,0.229,0.122,0.015,1288
3,Manager (Manager/Supervisor of Staff),776,0.195,0.084,0.01,776
4,"Senior Executive (President, CEO, etc)",30,0.133,0.067,0.0,30
5,Student (High School),232,0.134,0.043,0.017,232
6,Student (undergraduate/graduate),381,0.199,0.102,0.008,381


#### Doesn't look like Career level is indicative of location differentials. Let's look at Job type

In [30]:
compare_totals(df, 'JobType')

Unnamed: 0,JobType,USERID,state_diff,region_diff,country_diff,total
0,CONSTRUCTION,2023,0.132,0.061,0.009,2023
1,FINANCE,1155,0.394,0.18,0.02,1155
2,MEDICAL,424,0.193,0.097,0.0,424


#### Compare job types with vars

In [31]:
#Create dictionary to get totals
def compare_jobtype_totals(df, var):
    df = pd.concat([df, pd.get_dummies(df['JobType'])], axis=1)
    var_total  = df.groupby([var]).agg({'USERID':'count','FINANCE':'sum','MEDICAL':'sum','CONSTRUCTION':'sum'}).reset_index()      
    totals = pd.Series(var_total['USERID'].values,index=var_total[var]).to_dict() 
    var_total['total'] = var_total[var].map(totals) 
    var_total['FINANCE'] = round(var_total['FINANCE']/var_total['total'],3)
    var_total['MEDICAL'] = round(var_total['MEDICAL']/var_total['total'],3)
    var_total['CONSTRUCTION'] = round(var_total['CONSTRUCTION']/var_total['total'],3)
    return var_total

compare_jobtype_totals(df, "UserEd")

Unnamed: 0,UserEd,USERID,FINANCE,MEDICAL,CONSTRUCTION,total
0,Associate Degree,71,0.099,0.268,0.634,71
1,Bachelors Degree,248,0.435,0.149,0.415,248
2,Certification,71,0.183,0.127,0.69,71
3,Doctorate,20,0.75,0.1,0.15,20
4,High School or equivalent,316,0.057,0.136,0.807,316
5,Masters Degree,226,0.841,0.062,0.097,226
6,Professional,15,0.333,0.0,0.667,15
7,Some College Coursework Completed,166,0.163,0.114,0.723,166
8,Some High School Coursework,53,0.113,0.057,0.83,53
9,Unspecified,2195,0.322,0.111,0.568,2195


In [32]:
compare_jobtype_totals(df, "JobEd")

Unnamed: 0,JobEd,USERID,FINANCE,MEDICAL,CONSTRUCTION,total
0,Associate Degree,118,0.085,0.847,0.068,118
1,Bachelors Degree,923,0.852,0.108,0.04,923
2,Certification,254,0.0,0.362,0.638,254
3,Doctorate,12,1.0,0.0,0.0,12
4,High School or equivalent,1666,0.016,0.034,0.95,1666
5,Masters Degree,161,0.82,0.18,0.0,161
6,Professional,220,0.755,0.091,0.155,220
7,Some College Coursework Completed,43,0.488,0.302,0.209,43
8,Some High School Coursework,157,0.0,0.025,0.975,157
9,Vocational,46,0.022,0.174,0.804,46


#### Analysis: while data jobs see more applicants from different regions, the difference between the locations of data applicants and the location of nursing and transport applicants wasn’t significant enough to warrant inclusion in a model.

#### Similarly, I there's a slight increase in the likelihood of those further along in their careers to apply to jobs outside their state, but again the difference isn’t significant enough to warrant adding to the model. 

### Feature Selection & Engineering

In [33]:
list(df.job_country.unique())

['USA',
 nan,
 'SP',
 'UK',
 'GER',
 'FR',
 'CR',
 'IR',
 'CAN',
 'AUS',
 'IT',
 'NETH',
 'SW']

#### Note: I'm removing user_state from the model. If I apply to a job in new york that someone in London also applied for, I don't want my recommendations to be affected by what other people in London are applying to.

In [34]:
#Feature Engineering
df_train = df[['CAREERLEVEL','job_region','JobState','job_region','job_country','JobEd', 'JobType']]
df_train=df_train.fillna(value=0)

# Weighting

### I spent considerable time thinking about how to give certain variables more weight than others, as not all variables should be considered to have equal value in recommendations. For example, recommending an entry-level nursing job in Arizona that requires a bachelor’s degree for someone in Arizona who has a bachelor’s degree, is looking for an entry level position, but works in data would not be valuable. 

### **In order to put more weight on certain variables, I duplicated their columns: this will add an extra dimension to each observation’s vector and decrease its distance from observations that share that attribute’s value. I weighted the variables as follows:

#### •	Job type: 7

#### •	Job education: 4

#### •	Job state: 3

#### •	Job region:2

#### •	Job country: 2

#### •	User career: 1

In [36]:
def create_weights(df, weights):
    cols = list(df.columns)
    for i, col in enumerate(cols):        
        for x in range(0, weights[col]):
            #CREATE DUMMY VARIABLE FOR EACH VALUE IN COLUMNS WE CARE ABOUT
            df = pd.concat([df, pd.get_dummies(df[col])], axis=1)
    #drop columns not in dummy format
    df = df.drop(cols,axis=1)
    df=df.fillna(value=0)
    return df

In [37]:
weights = {'CAREERLEVEL':1, 'job_region':2,'job_country':2,'JobState':3,'JobEd':4, 'JobType':7}
df_train = create_weights(df_train, weights)

print('Number of Vars: ',len(list(df_train.columns)))

Number of Vars:  436


### Here, I'm "inorganically" applying weights to the variables. It's more important that we recommend a job to a user that is in their field than is in their state (i.e., it's better to show a data position in new york to a data applicant living in california than to show them a nursing job in california.) 

### This is not the cleanest or most glamorous method, but it should have the intended effect. 

#### Job type is most important, and after job type, the next most important should be job education level. The next most important variable is location, but if the states don't match, I want to suggest jobs in the general region, so I'm adding job_region. Finally, if there are multiple possibilites with those options, then we can look to see if users with similar educations and career levels applied.

#### The actual value of the weights is not as important as the size of the weights relative to each other.

#### I used job region so that if a position is not in the same state as the job previously applied to, positions in the same region will be prioritized over positions farther away. By using both region and state, it will prioritize regional jobs over non-regional jobs, but still award the most weight to jobs in the same state.

### Train

In [38]:
knn_weight = NearestNeighbors(metric='cosine',algorithm='brute', n_neighbors=40)
knn_weight.fit(df_train)
distances, indices = knn_weight.kneighbors(df_train)

In [39]:
#Indicies output is in an array
print(indices[0])

[3204 3058 3059    0 1206 1622 1525  335 3428 1523    2 2017  815  329
  814 3328  328  336  333 2579 2859  334 1205  327  326 2858  331  330
 1522 1204  332 1203 3133 2526 2910 1519 1520 1521 3216 2860]


#### We run into a problem here: since different applicants apply to the same job, our model has multiple rows for the same job even though the vectors are different. With a nurse from Texas with a bachelors, Our model wants to first recommend a job that a different nurse from texas with a bachelors applied for, and second wants to recommend a job that a nurse from texas with an associates applied for, since the model sees them as different. However, they're the same job, so we have to delete those duplciates. We can see how the duplicates arise from the example below

In [43]:
df[df.JOBID=='356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb']

Unnamed: 0,USERID,JOBID,UserState,CAREERLEVEL,UserEd,JobState,JobEd,JobType,job_country,user_country,job_region,user_region,state_diff,region_diff,country_diff
850,7767d3a3-1e80-4225-8a04-1daf22744546,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Student (undergraduate/graduate),Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
851,7767d3a3-1e80-4225-8a04-1daf22744546,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Student (undergraduate/graduate),Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
852,03a4b1e3-6504-4033-9d1a-5a70f06e6958,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Experienced (Non-Manager),Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
853,794c58f0-1929-4d22-8074-3c7339fb1f0e,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Experienced (Non-Manager),Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
854,5068bd5b-4a2c-4270-9f18-1163f86c69a7,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,"Executive (SVP, VP, Department Head, etc)",Bachelors Degree,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
855,57e06aa2-d564-4532-adac-c32c430580ea,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Manager (Manager/Supervisor of Staff),Bachelors Degree,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
856,195c64cb-9093-4e71-b42e-de5047e48838,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Entry Level,Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
857,e2ed7533-f1e6-4a51-985d-84149a76e199,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Entry Level,Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
858,cfa385a8-ccf6-45d3-8ddd-ad9a5c6ab467,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Manager (Manager/Supervisor of Staff),Unspecified,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0
859,9ff781af-03e0-4110-9ff7-201c32a8197e,356aa9a3-cdb8-4d88-8fb6-714dd04ba9bb,New York,Manager (Manager/Supervisor of Staff),Associate Degree,New York,Associate Degree,MEDICAL,USA,USA,N,N,0,0,0


#### We have 40 neighbors (excessive, but ensuring that we have enough in case the first 30 are all the job above). Lets create a list for each that can be added as columns

In [44]:
matches = [x for x in range(1, 41)]

#Create empty list for neighbor quantity (1-40)
for x in range(1,len(matches)+1):
    locals()['match_{0}'.format(x)] = []

#append the jobID of each neighbor to the appropriate list
for i in range(0, len(df_train)):
    for x in range(1, len(matches) + 1):
        locals()['match_{0}'.format(x)].append(df['JOBID'].iloc[indices[i][x-1]])
    

#### Now, add the lists as columns

In [45]:
results = df.copy()
for x in range(1, len(matches)+1):
    results['match_{0}'.format(x)] = locals()['match_{0}'.format(x)]
print(results.columns)

Index(['USERID', 'JOBID', 'UserState', 'CAREERLEVEL', 'UserEd', 'JobState',
       'JobEd', 'JobType', 'job_country', 'user_country', 'job_region',
       'user_region', 'state_diff', 'region_diff', 'country_diff', 'match_1',
       'match_2', 'match_3', 'match_4', 'match_5', 'match_6', 'match_7',
       'match_8', 'match_9', 'match_10', 'match_11', 'match_12', 'match_13',
       'match_14', 'match_15', 'match_16', 'match_17', 'match_18', 'match_19',
       'match_20', 'match_21', 'match_22', 'match_23', 'match_24', 'match_25',
       'match_26', 'match_27', 'match_28', 'match_29', 'match_30', 'match_31',
       'match_32', 'match_33', 'match_34', 'match_35', 'match_36', 'match_37',
       'match_38', 'match_39', 'match_40'],
      dtype='object')


#### We can see below that numerous users are being recommended the same job as their 1st and 2nd match

In [46]:
results['duplicate']=results.apply(lambda x: True if x['match_1']==x['match_2'] else False, axis=1)
print('Number of users with identical first and second job matches: ', len(results[results.duplicate]))

Number of users with identical first and second job matches:  591


#### Lets use the function below to find the top 3 unique job numbers

In [47]:
def delete_duplicates(row):
    applied = row[0]
    uniques = list(set(row.values.tolist()))    
    if applied in uniques:
        uniques.remove(applied)
    row[1], row[2], row[3] = uniques[0], uniques[1], uniques[2]
    return row

In [48]:
results_test = results[['JOBID','match_1',
       'match_2', 'match_3', 'match_4', 'match_5', 'match_6', 'match_7',
       'match_8', 'match_9', 'match_10', 'match_11', 'match_12', 'match_13',
       'match_14', 'match_15', 'match_16', 'match_17', 'match_18', 'match_19',
       'match_20', 'match_21', 'match_22', 'match_23', 'match_24', 'match_25',
       'match_26', 'match_27', 'match_28', 'match_29', 'match_30', 'match_31',
       'match_32','match_33','match_34','match_35','match_36','match_37','match_38',
        'match_39','match_40',]]

results_test = results_test.apply(lambda x: delete_duplicates(x), axis=1)

#### Let's check to see if we got rid of the duplicates

In [49]:
results_test['duplicate']=results_test.apply(lambda x: True if x['match_1']==x['match_2'] else False, axis=1)
print('Number of users with identical first and second job matches: ',len(results_test[results_test.duplicate]))
results_test['duplicate']=results_test.apply(lambda x: True if x['match_2']==x['match_3'] else False, axis=1)
print('Number of users with identical second and third job matches: ', len(results_test[results_test.duplicate]))
results_test['duplicate']=results_test.apply(lambda x: True if x['match_1']==x['match_3'] else False, axis=1)
print('Number of users with identical first and third job matches: ', len(results_test[results_test.duplicate]))
results_test['duplicate']=results_test.apply(lambda x: True if (x['JOBID']==x['match_1']) or (x['JOBID']==x['match_2'])
                                             or (x['JOBID']==x['match_3']) else False, axis=1)
print('Number of users shown recommendation they previously applied to: ', len(results_test[results_test.duplicate]))

Number of users with identical first and second job matches:  0
Number of users with identical second and third job matches:  0
Number of users with identical first and third job matches:  0
Number of users shown recommendation they previously applied to:  0


#### Perfect! Lets map those columns back onto the main results page

In [50]:
results['match_1'],results['match_2'], results['match_3'] = results_test['match_1'].tolist(),results_test['match_2'].tolist(), results_test['match_3'].tolist() 

results = results[['USERID', 'JOBID', 'UserState', 'CAREERLEVEL', 'UserEd', 'JobState','JobEd', 'JobType', 'job_country', 
           'user_country', 'job_region','user_region', 'state_diff', 'region_diff', 'country_diff', 'match_1','match_2', 
           'match_3']]

### How did we do?

In [51]:
results_indices = df[['USERID', 'JOBID', 'CAREERLEVEL', 'JobState','job_region','JobEd', 'JobType']].set_index('JOBID')

sample = results[['USERID', 'JOBID', 'CAREERLEVEL', 'JobState','job_region','JobEd', 'JobType',
           'match_1','match_2', 'match_3']].sample(5)
for index, row in sample.iterrows():
    print('JOB: ','\n', row,'\n')
    print('Rec 1: ',results_indices.loc[sample['match_1'].loc[index]],'\n')
    print('Rec 2: ',results_indices.loc[sample['match_2'].loc[index]],'\n')
    print('Rec 3: ',results_indices.loc[sample['match_3'].loc[index]],'\n')
    print('\n\n\n')

JOB:  
 USERID          9c0f783c-b249-41a2-9e0a-e4cb46018be4
JOBID           a3e34c0f-d175-4a88-9c8f-a40ff0ce7912
CAREERLEVEL    Manager (Manager/Supervisor of Staff)
JobState                                     Georgia
job_region                                         S
JobEd                      High School or equivalent
JobType                                 CONSTRUCTION
match_1         85fe685a-1cbf-4566-bde6-8079399eae2e
match_2         5a84fbd7-be03-43e1-af95-e3c3e5fb15b7
match_3         f168c43f-6114-4824-adca-bde6ddba6a8f
Name: 121, dtype: object 

Rec 1:  USERID         b4ea148d-c0a2-40ba-bf91-a10f369a8aa7
CAREERLEVEL                   Student (High School)
JobState                                    Georgia
job_region                                        S
JobEd                     High School or equivalent
JobType                                CONSTRUCTION
Name: 85fe685a-1cbf-4566-bde6-8079399eae2e, dtype: object 

Rec 2:                                                 

## The recommendations look pretty good! Although messy, it looks like duplicating columns can function as a weighting algorithm for KNN!

#### In addition, a more sophisticated algorithm to recommend regional jobs would improve the model. While the model does well in recommending jobs in the same region for certain parts of the US (say, recommending a job in Pennsylvania to an applicant that lives in New Jersey), the same algorithm doesn’t apply as well for the Midwest and West Coast, where the algorithm will recommend a job in Texas to someone from Colorado (not an outrageous recommendation, but certainly not as ideal as NJ to PA). I also didn’t develop an efficient way to account for rows with certain variables blank: if a location is blank, it can skew recommendations since the algorithm considers it a match with other attributes that are blank.

#### Commentary: There are plenty of aspects that could be improved in this model. For one, empirical evidence regarding tradeoffs for applicants would be key: would a data analyst be more likely to apply to a job outside his or her state if the education requirement meets his/her own? 

In [None]:
Let me know what you think! email: austintarullo@gmail.com