# Recommender System Explanation

## Set up

### Import Libraries

In [1]:
import pickle
import numpy

import pandas
import dask.dataframe as dask_dataframe

from src.generate_ratings import generate_rating_set
from src.generate_train_test import generate_train_test_set

from sklearn.neighbors import NearestNeighbors

### Load Raw Dataset

In [2]:
user_ratings_df = pandas.read_table('dataset/apps.tsv', sep='\t')
users_df = pandas.read_table('dataset/users.tsv', sep='\t')
popular_jobs_df = pandas.read_csv('dataset/popular_jobs_full.csv')
jobs_df = pandas.read_table('dataset/jobs.tsv', sep='\t', on_bad_lines='skip')

print(f'User rating dimension: {user_ratings_df.shape}')
print(f'User detail dimension: {users_df.shape}')
print(f'Popular jobs dimension: {popular_jobs_df.shape}')
print(f'Job detail dimension: {jobs_df.shape}')


display(user_ratings_df.head(5))
display(users_df.head(5))
display(popular_jobs_df.head(5))
display(jobs_df.head(5))

User rating dimension: (1603111, 5)
User detail dimension: (389708, 15)
Popular jobs dimension: (389708, 2)
Job detail dimension: (1091923, 11)


  jobs_df = pandas.read_table('dataset/jobs.tsv', sep='\t', on_bad_lines='skip')


Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748


Unnamed: 0,UserID,WindowID,Split,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany
0,47,1,Train,Paramount,CA,US,90723,High School,,1999-06-01 00:00:00,3,10.0,Yes,No,0
1,72,1,Train,La Mesa,CA,US,91941,Master's,Anthropology,2011-01-01 00:00:00,10,8.0,Yes,No,0
2,80,1,Train,Williamstown,NJ,US,8094,High School,Not Applicable,1985-06-01 00:00:00,5,11.0,Yes,Yes,5
3,98,1,Train,Astoria,NY,US,11105,Master's,Journalism,2007-05-01 00:00:00,3,3.0,Yes,No,0
4,123,1,Train,Baton Rouge,LA,US,70808,Bachelor's,Agricultural Business,2011-05-01 00:00:00,1,9.0,Yes,No,0


Unnamed: 0,UserID,JobID
0,47,982331 979937 821993 711528 688334 687409 6265...
1,72,1105054 1009975 960607 946809 943507 935727 90...
2,80,1073864 899228 87383 1116150 1116003 1115812 1...
3,98,1100130 1084489 1000230 935816 907760 877552 8...
4,123,1108926 1107596 1107473 1107254 1107158 110197...


Unnamed: 0,JobID,WindowID,Title,Description,Requirements,City,State,Country,Zip5,StartDate,EndDate
0,1,1,Security Engineer/Technical Lead,<p>Security Clearance Required:&nbsp; Top Secr...,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,Washington,DC,US,20531.0,2012-03-07 13:17:01.643,2012-04-06 23:59:59
1,4,1,SAP Business Analyst / WM,<strong>NO Corp. to Corp resumes&nbsp;are bein...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Charlotte,NC,US,28217.0,2012-03-21 02:03:44.137,2012-04-20 23:59:59
2,7,1,P/T HUMAN RESOURCES ASSISTANT,<b> <b> P/T HUMAN RESOURCES ASSISTANT</b> <...,Please refer to the Job Description to view th...,Winter Park,FL,US,32792.0,2012-03-02 16:36:55.447,2012-04-01 23:59:59
3,8,1,Route Delivery Drivers,CITY BEVERAGES Come to work for the best in th...,Please refer to the Job Description to view th...,Orlando,FL,US,,2012-03-03 09:01:10.077,2012-04-02 23:59:59
4,9,1,Housekeeping,I make sure every part of their day is magica...,Please refer to the Job Description to view th...,Orlando,FL,US,,2012-03-03 09:01:11.88,2012-04-02 23:59:59


## Data Preprocessing

### Step 1: Scope dataset to location based dataset

Since most of the users and job posts are collected from the US. We focus primarily on the US data and filter state with enough data points

In [11]:
MIN_STATE_THRESHOLD = 1000
print(f'User details: Total user globally = {users_df.shape[0]}')
us_users_df = users_df[users_df['Country']=='US']
print(f'User details: US focus user = {us_users_df.shape[0]}\n')

print(f'User rating: Full ratings count: {user_ratings_df.shape[0]}')

unique_states_df = us_users_df['State'].value_counts().to_frame()
state_to_keep = unique_states_df[unique_states_df['State'] >= MIN_STATE_THRESHOLD].index.tolist()

print('----------------------------------------------')
print(f'State: All US state count: {unique_states_df.shape[0]}')
print(f'State: US state with minimum {MIN_STATE_THRESHOLD} userbase number: {len(state_to_keep)}')
display(unique_states_df.head(10))
print('----------------------------------------------')

print(f'User details: Before states filtering = {us_users_df.shape[0]}')
us_users_df = us_users_df[us_users_df['State'].isin(state_to_keep)]
print(f'User details: After states filtering = {us_users_df.shape[0]}\n')

print(f'User ratings: Before states filtering = {user_ratings_df.shape[0]}')
temp_user_ratings_df = user_ratings_df[user_ratings_df['UserID'].isin(us_users_df['UserID'].values.tolist())]
print(f'User ratings: After states filtering = {temp_user_ratings_df.shape[0]}\n')

display(us_users_df.head(10))
display(temp_user_ratings_df.head(10))

User details: Total user globally = 389708
User details: US focus user = 388499

User rating: Full ratings count: 1603111
----------------------------------------------
State: All US state count: 56
State: US state with minimum 1000 userbase number: 37


Unnamed: 0,State
FL,43100
TX,35864
CA,33019
IL,24322
NY,20578
PA,17103
GA,16156
NJ,15706
OH,15498
NC,14755


----------------------------------------------
User details: Before states filtering = 388499
User details: After states filtering = 382829

User ratings: Before states filtering = 1603111
User ratings: After states filtering = 1579408


Unnamed: 0,UserID,WindowID,Split,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany
0,47,1,Train,Paramount,CA,US,90723,High School,,1999-06-01 00:00:00,3,10.0,Yes,No,0
1,72,1,Train,La Mesa,CA,US,91941,Master's,Anthropology,2011-01-01 00:00:00,10,8.0,Yes,No,0
2,80,1,Train,Williamstown,NJ,US,8094,High School,Not Applicable,1985-06-01 00:00:00,5,11.0,Yes,Yes,5
3,98,1,Train,Astoria,NY,US,11105,Master's,Journalism,2007-05-01 00:00:00,3,3.0,Yes,No,0
4,123,1,Train,Baton Rouge,LA,US,70808,Bachelor's,Agricultural Business,2011-05-01 00:00:00,1,9.0,Yes,No,0
5,131,1,Train,Houston,TX,US,77077,Bachelor's,Finance,1998-05-01 00:00:00,3,14.0,,No,0
6,162,1,Train,Long Beach,CA,US,90807,Master's,I/O Psychology,2012-05-01 00:00:00,10,25.0,No,No,0
7,178,1,Train,Greenville,SC,US,29609,High School,Not Applicable,,6,35.0,No,Yes,4
9,344,1,Train,Newport News,VA,US,23601,High School,Not Applicable,2007-01-01 00:00:00,3,7.0,Yes,No,0
10,395,1,Train,Wildwood,MO,US,63038,Bachelor's,Marketing,,10,24.0,Yes,No,0


Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748
5,47,1,Train,2012-04-05 02:34:40.223,576958
6,47,1,Train,2012-04-05 22:55:03.583,262470
7,47,1,Train,2012-04-05 02:38:49.52,602298
8,72,1,Train,2012-04-02 22:36:43.033,834662
9,72,1,Train,2012-04-07 15:19:58.187,1020903


### Step 2: User Application Filtering

In [12]:
MIN_APPLICATION = 15 # TODO: Set minimum user application threshold

print('----------------------------------------------')
unique_user_table = temp_user_ratings_df['UserID'].value_counts().to_frame()
print(f'User: Full user count: {unique_user_table.shape[0]}')
unique_user_table = unique_user_table[unique_user_table['UserID']>=MIN_APPLICATION]
print(f'User: Filtered user count: {unique_user_table.shape[0]} for user with over {MIN_APPLICATION} application threshold.')
display(unique_user_table.head(10))
print('----------------------------------------------')

user_to_keep = unique_user_table.index.tolist()
print(f'User rating: Before application filter: {temp_user_ratings_df.shape[0]}')
temp_user_ratings_df = temp_user_ratings_df[temp_user_ratings_df['UserID'].isin(user_to_keep)]
print(f'User rating: After application filter: {temp_user_ratings_df.shape[0]}')
display(temp_user_ratings_df.head(10))

----------------------------------------------
User: Full user count: 315678
User: Filtered user count: 20128 for user with over 15 application threshold.


Unnamed: 0,UserID
296500,2473
1127206,1157
1382297,992
991504,499
802983,436
985229,426
33263,376
621858,345
948207,342
466070,330


----------------------------------------------
User rating: Before application filter: 1579408
User rating: After application filter: 629204


Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
51,554,1,Train,2012-04-02 11:05:25.333,196603
52,554,1,Train,2012-04-02 05:08:41.413,300053
53,554,1,Train,2012-04-02 11:05:24.333,1078274
54,554,1,Train,2012-04-02 11:05:26.087,146817
55,554,1,Train,2012-04-02 11:05:25.833,654538
56,554,1,Train,2012-04-02 05:08:41.303,336293
57,554,1,Train,2012-04-02 11:05:25.76,640492
58,554,1,Train,2012-04-02 11:18:18.257,271546
59,554,1,Train,2012-04-02 05:08:44.72,283949
60,554,1,Train,2012-04-02 05:11:22.263,1066757


### Step 3: Generate Negative Job Sample

In [14]:
NEGATIVE_SAMPLE_NUM = 8  # TODO: Set the minimum negative sampling
NEGATIVE_SAMPLE_RATIO = 0.8 # TODO: Set the ratio of negative sampling to positive ones

neg_temp_ratings_df = generate_rating_set(user_apps_df=temp_user_ratings_df, popular_jobs_df=popular_jobs_df, negative_num=NEGATIVE_SAMPLE_NUM, negative_ratio=NEGATIVE_SAMPLE_RATIO, negative_value=-1)
neg_temp_ratings_df.to_csv('dataset/working/user_ratings_neg_1000_15_10_08.csv')
print(f'User rating: Positive ratings count: {temp_user_ratings_df.shape[0]}')
print(f'User rating: Positive with negative ratings count: {neg_temp_ratings_df.shape[0]}')
display(neg_temp_ratings_df.head(10))

Total unique user: (20128,)
Progress at index: 0 size: (92, 3)
Progress at index: 5000 size: (290642, 3)
Progress at index: 10000 size: (557470, 3)
Progress at index: 15000 size: (825452, 3)
Progress at index: 20000 size: (1084172, 3)
(1090629, 3)
User rating: Positive ratings count: 629204
User rating: Positive with negative ratings count: 1090629


Unnamed: 0,UserID,JobID,Rating
0,554,196603,1
1,554,300053,1
2,554,1078274,1
3,554,146817,1
4,554,654538,1
5,554,336293,1
6,554,640492,1
7,554,271546,1
8,554,283949,1
9,554,1066757,1


#### Get Full
- 1000, 10, 10, 1
- 1000, 15, 10, 0.8

In [21]:
neg_temp_ratings_df = pandas.read_csv('dataset/working/user_ratings_neg_1000_15_10_08.csv')

### Step 4: Job Occurrence Filtering

In [15]:
MIN_OCCURRENCE = 10 # TODO: Set minimum job occurrence threshold

print(f'User rating: Positive with negative ratings count: {neg_temp_ratings_df.shape[0]}')

print('----------------------------------------------')
unique_job_table = neg_temp_ratings_df['JobID'].value_counts().to_frame()
print(f'Job: Full job count: {unique_job_table.shape[0]}')
unique_job_table = unique_job_table[unique_job_table['JobID']>=MIN_OCCURRENCE]
print(f'Job: Filtered job count: {unique_job_table.shape[0]} for job with over {MIN_OCCURRENCE} application threshold.')
display(unique_job_table.head(10))
print('----------------------------------------------')

job_to_keep = unique_job_table.index.tolist()
print(f'User rating: Before job filter: {neg_temp_ratings_df.shape[0]}')
temp_neg_temp_user_ratings_df = neg_temp_ratings_df[neg_temp_ratings_df['JobID'].isin(job_to_keep)]
print(f'User rating: After job filter: {temp_neg_temp_user_ratings_df.shape[0]}')
temp_neg_temp_user_ratings_df.head(10)

User rating: Positive with negative ratings count: 1090629
----------------------------------------------
Job: Full job count: 297733
Job: Filtered job count: 22498 for job with over 10 application threshold.


Unnamed: 0,JobID
1116220,222
1116172,220
1115878,206
1115760,203
1115910,197
1115986,193
1116306,193
1115707,189
1115645,187
1115443,186


----------------------------------------------
User rating: Before job filter: 1090629
User rating: After job filter: 513656


Unnamed: 0,UserID,JobID,Rating
0,554,196603,1
1,554,300053,1
2,554,1078274,1
3,554,146817,1
4,554,654538,1
6,554,640492,1
8,554,283949,1
11,554,1042648,1
13,554,600058,1
14,554,25820,1


### Step 5: Train Test Split

In [16]:
TRAIN_RATIO = 0.9

train_ratings_df, test_ratings_df = generate_train_test_set(user_ratings_full=temp_neg_temp_user_ratings_df, split_ratio=TRAIN_RATIO, split_seed=1234, min_split_size=5)

train_ratings_df.to_csv('dataset/working/user_ratings_neg_1000_15_10_08_train.csv')
test_ratings_df.to_csv('dataset/working/user_ratings_neg_1000_15_10_08_test.csv')

print(f'User rating: Train count: {train_ratings_df.shape[0]}')
print(f'User rating: Test count: {test_ratings_df.shape[0]}\n')

print(f'User rating: Unique user: {train_ratings_df["UserID"].drop_duplicates().shape[0]}')
print(f'User rating: Unique job: {train_ratings_df["JobID"].drop_duplicates().shape[0]}')
display(train_ratings_df.head(10))

Total unique users: 19123
Total user ratings: 513656
Progress at index: 0 or 0.0000 percent of total
Progress at index: 5000 or 0.2615 percent of total
Progress at index: 10000 or 0.5229 percent of total
Progress at index: 15000 or 0.7844 percent of total
User count: 17184
Train size of 0.9 total ratings count: 455157
Test size total ratings count: 58499
User rating: Train count: 455157
User rating: Test count: 58499

User rating: Unique user: 19123
User rating: Unique job: 22497


Unnamed: 0,UserID,JobID,Rating
0,554,946506,1
1,554,773410,1
2,554,35071,1
3,554,300020,1
4,554,491965,1
5,554,196603,1
6,554,802921,1
7,554,1042648,1
8,554,146817,1
9,554,497069,1


#### Get Train Test

In [4]:
train_ratings_df = pandas.read_csv('dataset/working/user_ratings_neg_1000_20_20_1_train.csv', index_col=0)
test_ratings_df = pandas.read_csv('dataset/working/user_ratings_neg_1000_20_20_1_test.csv', index_col=0)
print(f'User rating: Unique user: {train_ratings_df["UserID"].drop_duplicates().shape[0]}')
print(f'User rating: Unique job: {train_ratings_df["JobID"].drop_duplicates().shape[0]}')
display(train_ratings_df.head(10))

User rating: Unique user: 10943
User rating: Unique job: 7712


Unnamed: 0,UserID,JobID,Rating
0,554,640492,1
1,554,280275,1
2,554,747584,1
3,554,25820,1
4,554,600058,1
5,554,196603,1
6,554,1113088,1
7,554,40,1
8,554,627377,1
9,554,957,1


## Collaborative Filtering KNN

### Step 1: User Rating Pivot

In [5]:
user_map_data = train_ratings_df['UserID'].drop_duplicates().reset_index(drop=True)
user_mapping = user_map_data.to_dict()
print(f'Total user ID: {len(user_mapping)}')

job_map_data = train_ratings_df['JobID'].drop_duplicates().reset_index(drop=True)
job_mapping = job_map_data.to_dict()
print(f'Total job ID: {len(job_mapping)}')
    
user_mapping_inv =  {v: k for k, v in user_mapping.items()}
job_mapping_inv =  {v: k for k, v in job_mapping.items()}

train_ratings_df['UserID'] = train_ratings_df['UserID'].apply(lambda x: user_mapping_inv[x]).astype('uint16')
train_ratings_df['JobID'] = train_ratings_df['JobID'].apply(lambda x: job_mapping_inv[x]).astype('uint16')
train_ratings_df['Rating'] = train_ratings_df['Rating'].astype('int8')

print('User rating dtypes:')
display(train_ratings_df.dtypes)

Total user ID: 10943
Total job ID: 7712
User rating dtypes:


UserID    uint16
JobID     uint16
Rating      int8
dtype: object

In [6]:
final_train_ratings_df = train_ratings_df.drop_duplicates(['UserID', 'JobID'])
final_train_ratings_dsk = dask_dataframe.from_pandas(final_train_ratings_df, 10)

final_train_ratings_dsk = final_train_ratings_dsk.astype({'UserID':'uint16', 'JobID': 'category', 'Rating': 'int8'})
final_train_ratings_dsk['JobID'] = final_train_ratings_dsk['JobID'].cat.as_known()

print('Dask dtypes:')
display(final_train_ratings_dsk.dtypes)

user_ratings_pivot = dask_dataframe.DataFrame.pivot_table(final_train_ratings_dsk, index='UserID', columns='JobID', values='Rating').fillna(0).astype('int8')

print(f'Pivot table with total user rows: {user_ratings_pivot.shape[0]}')
print(f'Pivot table with total job columns: {user_ratings_pivot.shape[1]}')

display(user_ratings_pivot.head(10))

Dask dtypes:


UserID      uint16
JobID     category
Rating        int8
dtype: object

Pivot table with total user rows: Delayed('int-aa520fe6-772e-4d4b-80e6-c128136929e2')
Pivot table with total job columns: 7712


JobID,0,1,2,3,4,5,6,7,8,9,...,7702,7703,7704,7705,7706,7707,7708,7709,7710,7711
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
with open('dataset/working/pivot_1000_20_20_1.pkl', 'wb') as file:
    pickle.dump(user_ratings_pivot.compute(), file)
print('Save completed')

Save completed


#### Get Pivot

In [None]:
with open('dataset/working/pivot_1000_20_20_1.pkl', 'rb') as file:
    user_ratings_pivot = pickle.load(file)
print('Load completed')

### Step 2: User KNN Model

In [9]:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=50, n_jobs=-1)
model_knn.fit(user_ratings_pivot)

### Step 3: Make Recommendation

In [11]:
INSTANCE_INDEXES = [1200,1250]
inst_k = user_ratings_pivot.loc[INSTANCE_INDEXES[0]:INSTANCE_INDEXES[1]]
inst_k_scores, inst_k_neighbors = model_knn.kneighbors(inst_k)
print(f'Generate recommendation instances from {INSTANCE_INDEXES}')
display(inst_k_scores[:5][:3])
display(inst_k_neighbors[:5][:3])

Generate recommendation instances from [1200, 1250]


array([[1.11022302e-16, 2.31623930e-01, 2.44988011e-01, 2.69749270e-01,
        2.88487526e-01, 2.88487526e-01, 2.89953053e-01, 2.92893219e-01,
        2.98439240e-01, 2.99650708e-01, 3.05634925e-01, 3.06624755e-01,
        3.28248558e-01, 3.33027031e-01, 3.41293517e-01, 3.53606873e-01,
        3.54502776e-01, 3.65018725e-01, 3.65254550e-01, 3.67544468e-01,
        3.72504980e-01, 3.72623566e-01, 3.76294444e-01, 3.95833333e-01,
        3.96749397e-01, 4.06919747e-01, 4.08039979e-01, 4.11947348e-01,
        4.20249096e-01, 4.26585361e-01, 4.26585361e-01, 4.33861483e-01,
        4.35923925e-01, 4.40207132e-01, 4.40348120e-01, 4.46601409e-01,
        4.46705992e-01, 4.64174119e-01, 4.81732511e-01, 4.87794327e-01,
        4.89337820e-01, 4.89689637e-01, 4.91767001e-01, 4.95502158e-01,
        4.99306037e-01, 5.10021056e-01, 5.13335737e-01, 5.16698326e-01,
        5.18874776e-01, 5.31808908e-01],
       [1.11022302e-16, 6.27895796e-01, 6.61938298e-01, 6.83772234e-01,
        6.83772234e-01,

array([[1200, 1446,  409,  438, 2372, 1386, 1982, 1997, 1456,  688,  215,
        1110, 2542, 1322, 1720, 2334,  344, 2633, 2435, 1924,  719, 1709,
        2132,  171,  592, 1919, 1750,  973, 1353, 2135, 1902,  904, 1799,
        1872, 1946,  293,  280,  195,  330, 1927,  777,   76, 1426, 1663,
        1196,  163,  903,  453,  118, 2140],
       [1201, 1728, 1144, 2297, 2227,  562, 2620,  120, 2657, 1672,   20,
        1579,  700, 1221, 1314, 2247, 1387,  628, 1598, 1050, 1412, 1693,
        2639, 1667, 1899, 2068, 1804, 1051, 1432,  861,  243, 1088, 2459,
        2538, 1117, 2259, 2662, 1559, 2282,   91, 1627, 2129,  263,  111,
        1049, 1591, 2191,  383, 1520, 1891],
       [1202, 2378, 1397,  795,  182,  990, 2067, 1334,  187,  614,  952,
        1951,  284, 1047,  167,  983, 1761, 1396, 1285,  960,  865, 1054,
        1626,  110, 2179, 2654, 1472, 1942,  902,  530,  434, 1284,  895,
         116, 2059, 1763,  373,  375,  191, 1413, 1599, 2157,  980, 1143,
        1420, 1923,  2

### Step 4: Find Neighbors

In [12]:
inst_k_neighbors_list = [] # List of df, each df contain user detail of neighbors
inst_k_neighbors_rating_list = [] # List of df, each df 
for each in inst_k_neighbors:
    temp_neighbors_list = [user_mapping[x] for x in each]
    
    temp_neighbors_df = users_df[users_df['UserID'].isin(temp_neighbors_list)]
    inst_k_neighbors_list.append(temp_neighbors_df)

    temp_neighbors_rating_df = final_train_ratings_df[final_train_ratings_df['UserID'].isin(each)]
    inst_k_neighbors_rating_list.append(temp_neighbors_rating_df)
    
print('Generated neighbors set completed')

Generated neighbors set completed


## Generate Explanation

### Section 1: User profile scenario

In [48]:
def get_user_profile_template(user_detail_row):
    user_text = 'Your are user ID {:} who live in {:}, {:} state. You have a {:} specialized in {:} with {:} years of experience. You have had {:} jobs in the past, and {:} managed {:} other before.\n'.format(user_detail_row['UserID'], user_detail_row['City'], user_detail_row['State'], user_detail_row['DegreeType'], user_detail_row['Major'], user_detail_row['TotalYearsExperience'], user_detail_row['WorkHistoryCount'], 'have' if user_detail_row['ManagedOthers'] == 'Yes' else 'have never', user_detail_row['ManagedHowMany'] if user_detail_row['ManagedOthers'] == 'Yes' else '')
    return user_text

for k_neighbors in inst_k_neighbors_list:
    inst_row = k_neighbors.iloc[0]
    print(get_user_profile_template(inst_row))

Your are user ID 61035 who live in Holbrook, AZ state. You have a High School specialized in General Studies with 9.0 years of experience. You have had 7 jobs in the past, and have never managed  other before.

Your are user ID 16621 who live in Dallas, TX state. You have a Bachelor's specialized in Management with 4.0 years of experience. You have had 2 jobs in the past, and have never managed  other before.

Your are user ID 92802 who live in Atlanta, GA state. You have a None specialized in nan with 7.0 years of experience. You have had 4 jobs in the past, and have never managed  other before.

Your are user ID 22358 who live in Riverdale, GA state. You have a High School specialized in Not Applicable with 6.0 years of experience. You have had 2 jobs in the past, and have managed 10 other before.

Your are user ID 88381 who live in Calumet City, IL state. You have a High School specialized in Not Applicable with 5.0 years of experience. You have had 3 jobs in the past, and have neve

### Section 2: User characteristic-based explanation
Select feature to include in explanations:
- Degree Type - N people share your level of education (DegreeType)
- Years of experience - average year of experience in your range of (TotalYearsExperience)
- Manage others - You and N other has also (never) managed a team before
Count the same occurrence of with the dataset (N people share your level of degree (Bachelors))

In [51]:
# The first row of neighbors_df is itself
def get_explanation_user(neighbors_df_list):
    explanation_user_list = []
    for neighbors_df in neighbors_df_list:
        center = neighbors_df.iloc[0]
        circles = neighbors_df.iloc[1:]
        
        # Level of education
        if circles[circles['DegreeType'] == center['DegreeType']].shape[0] > 0:
            n_share_degree = circles['DegreeType'].value_counts()[center['DegreeType']]
        else:
            n_share_degree = 'None'
        
        # Average years of experience
        not_nan_experience = [val for val in circles['TotalYearsExperience'].values if not numpy.isnan(val)]
        avg_year_experience = numpy.mean(not_nan_experience)
        
        # Management experience
        is_manager = ' never' if center['ManagedOthers'] == 'No' else ''
        n_same_management = len(circles[circles['ManagedOthers']==center['ManagedOthers']])
        
        # Templating
        explanation_user_text = 'We suggested these jobs because you share similarities with people who also applied to these jobs. Out of {:}, {:} people share your level of education ({:}). Their average year of experience is within your level of {:.2f} years (vs yours {:} years). You and {:} others also has{:} managed a team before.\n'.format(circles.shape[0], n_share_degree, center['DegreeType'], avg_year_experience, center['TotalYearsExperience'], n_same_management, is_manager)
        
        explanation_user_list.append({'explanation': explanation_user_text, 'data': [n_share_degree, center['DegreeType'], avg_year_experience, n_same_management, is_manager], 'user_profile': get_user_profile_template(center)})
    
    return explanation_user_list
                         
                         
expl_list = get_explanation_user(inst_k_neighbors_list) 
for df, expl in zip(inst_k_neighbors_list, expl_list):
    # display(df.iloc[0])
    print(expl['explanation'])

We suggested these jobs because you share similarities with people who also applied to these jobs. Out of 49, 15 people share your level of education (High School). Their average year of experience is within your level of 10.87 years (vs yours 9.0 years). You and 34 others also has never managed a team before.

We suggested these jobs because you share similarities with people who also applied to these jobs. Out of 49, 7 people share your level of education (Bachelor's). Their average year of experience is within your level of 9.76 years (vs yours 4.0 years). You and 38 others also has never managed a team before.

We suggested these jobs because you share similarities with people who also applied to these jobs. Out of 49, 13 people share your level of education (None). Their average year of experience is within your level of 9.89 years (vs yours 7.0 years). You and 35 others also has never managed a team before.

We suggested these jobs because you share similarities with people who a

### Section 3: Application history-based explanation

In [46]:
def get_explanation_history(neighbors_df_list, neighbor_ratings_df_list, top_n=5):
    explanation_user_list = []
    for neighbors_df, rating_df in zip(neighbors_df_list, neighbor_ratings_df_list):
        center = neighbors_df.iloc[0]
        circles = neighbors_df.iloc[1:]
        
        
        temp_history = pandas.pivot_table(rating_df, index='JobID', values='Rating', aggfunc='sum').sort_values(['Rating'], ascending=False)[:top_n]
        share_applied_jobs_list = [job_mapping[idx] for idx in temp_history.index.tolist()]
        share_people_num_list = temp_history['Rating'].values.tolist()
        
        jobs_detail_df = jobs_df[jobs_df['JobID'].isin(share_applied_jobs_list)]

        
        jobs_list = ['Job ID ' + str(data['JobID']) + ' : ' + str(data['Title']) + ' in ' + str(data['City']) + '-' + str(data['State'] + ' like ' + str(num) + ' others') for (idx, data), num in zip(jobs_detail_df.iterrows(), share_people_num_list)]
        jobs_text = ',\n'.join(jobs_list)
        
        explanation_history_text = 'We suggested these jobs because people who applied to similar jobs to you in the past also applied. We look into your history you have applied to\n{:} in the past.\n'.format(jobs_text)
        
        explanation_user_list.append({'explanation': explanation_history_text, 'data': jobs_list, 'user_profile': get_user_profile_template(center)})
        
    return explanation_user_list

expl_list = get_explanation_history(inst_k_neighbors_list, inst_k_neighbors_rating_list, 3)
for df, expl in zip(inst_k_neighbors_list, expl_list):
    # display(df.iloc[0])
    print(expl['explanation'])

We suggested these jobs because people who applied to similar jobs to you in the past also applied. We look into your history you have applied to
Job ID 741469 : Customer Service Experience Wanted! in Scottsdale-AZ like 21 others,
Job ID 790763 : Manufacturing Manager in Lacombe-LA like 19 others,
Job ID 1038468 : Customer Service/Support in Phoenix-AZ like 17 others in the past.

We suggested these jobs because people who applied to similar jobs to you in the past also applied. We look into your history you have applied to
Job ID 460224 : Customer Service Representative / Researcher in Frisco-TX like 31 others,
Job ID 741664 : Inbound Customer Service Call Center Reps Needed NOW! in Lewisville-TX like 26 others,
Job ID 1036760 : Front Office Administrative Staff in Plano-TX like 24 others in the past.

We suggested these jobs because people who applied to similar jobs to you in the past also applied. We look into your history you have applied to
Job ID 779460 : Administrative Assistan

### Section 4: Get full individual sample

In [52]:
expl_list_user = get_explanation_user(inst_k_neighbors_list)
expl_list_history = get_explanation_history(inst_k_neighbors_list, inst_k_neighbors_rating_list, 3)
for idx, expl_user, expl_hist in zip(range(len(expl_list_history)), expl_list_user, expl_list_history):
    print('Case:', idx)
    print('User profile:', expl_user['user_profile'])
    print('Explanation Type I:', expl_user['explanation'])
    print('Explanation Type II:', expl_hist['explanation'])

Case: 0
User profile: Your are user ID 61035 who live in Holbrook, AZ state. You have a High School specialized in General Studies with 9.0 years of experience. You have had 7 jobs in the past, and have never managed  other before.

Explanation Type I: We suggested these jobs because you share similarities with people who also applied to these jobs. Out of 49, 15 people share your level of education (High School). Their average year of experience is within your level of 10.87 years (vs yours 9.0 years). You and 34 others also has never managed a team before.

Explanation Type II: We suggested these jobs because people who applied to similar jobs to you in the past also applied. We look into your history you have applied to
Job ID 741469 : Customer Service Experience Wanted! in Scottsdale-AZ like 21 others,
Job ID 790763 : Manufacturing Manager in Lacombe-LA like 19 others,
Job ID 1038468 : Customer Service/Support in Phoenix-AZ like 17 others in the past.

Case: 1
User profile: Your ar