# Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

# Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

In [1]:
import pandas as pd
df = pd.read_csv('potential-talents.csv')
df

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,
101,102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,


The job title seems unique for each candidate. First, I am going to see if the dataset represent any specific sector, i.e. checking if this dataset is a result of semi-automatic candidate sourcing that was mentioned in the brief, based on a specific keyword/s

In [12]:
job_titles = df.job_title.unique()
job_titles

array(['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
       'Native English Teacher at EPIK (English Program in Korea)',
       'Aspiring Human Resources Professional',
       'People Development Coordinator at Ryan',
       'Advisory Board Member at Celal Bayar University',
       'Aspiring Human Resources Specialist',
       'Student at Humber College and Aspiring Human Resources Generalist',
       'HR Senior Specialist',
       'Seeking Human Resources HRIS and Generalist Positions',
       'Student at Chapman University',
       'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR',
       'Human Resources Coordinator at InterContinental Buckhead Atlanta',
       'Aspiring Human Resources Management student seeking an internship',
       'Seeking Human Resources Opportunities',
       'Experienced Retail Manager and aspiring Human Resources Professional',
       'H

In [30]:
#counting the unique words in each job title and storing it in a dictionary
from collections import defaultdict
import operator

temp = defaultdict(int)
for sub in job_titles:
    for wrd in sub.split():
        temp[wrd] += 1
list((dict(sorted(temp.items(), key=operator.itemgetter(1),reverse=True))).items())[:10] #listing the top 10 uniques words within job titles

[('Human', 33),
 ('Resources', 28),
 ('at', 22),
 ('and', 13),
 ('Aspiring', 10),
 ('|', 10),
 ('in', 6),
 ('Professional', 6),
 ('University', 6),
 ('Seeking', 6)]

It seems 'Human' and 'Resources' is the most commonly occuring words in the given dataset, suggesting this dataset is likely a result of search result based on the keyword 'Human Resources'. Assuming we are interested in candidates in the human resources sector, we can filter the given dataset to only include candidate who has 'Human Resources' and/or 'HR' in job title, resulting in a dataframe that has candidates who are currently in and/or interested in pursuing a career in HR.

In [90]:
#internal check to see what are the non-hr related job titles
df_non_hr = df[~df['job_title'].str.contains('Human|Resources|HR')]
df_non_hr.job_title.unique()

array(['Native English Teacher at EPIK (English Program in Korea)',
       'People Development Coordinator at Ryan',
       'Advisory Board Member at Celal Bayar University',
       'Student at Chapman University',
       'Junior MES Engineer| Information Systems',
       'RRP Brand Portfolio Executive at JTI (Japan Tobacco International)',
       'Information Systems Specialist and Programmer with a love for data and organization.',
       'Bachelor of Science in Biology from Victoria University of Wellington',
       'Undergraduate Research Assistant at Styczynski Lab',
       'Lead Official at Western Illinois University',
       'Seeking employment opportunities within Customer Service or Patient Care',
       'Admissions Representative at Community medical center long beach',
       'Student at Westfield State University',
       'Student at Indiana University Kokomo - Business Management - \nRetail Manager at Delphi Hardware and Paint',
       'Student', 'Business Intelligence an

In [109]:
#fitlering hr related candidates into a single dataframe
df_hr = df[df['job_title'].str.contains('Human|Resources|HR')]
df_hr

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500+,
...,...,...,...,...,...
93,94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,


In [110]:
df_hr.dtypes

id              int64
job_title      object
location       object
connection     object
fit           float64
dtype: object

In [111]:
df_hr.location.unique()

array(['Houston, Texas', 'Raleigh-Durham, North Carolina Area',
       'Greater New York City Area', 'Kanada', 'San Francisco Bay Area',
       'Greater Philadelphia Area', 'Houston, Texas Area',
       'Atlanta, Georgia', 'Chicago, Illinois', 'Austin, Texas Area',
       'Jackson, Mississippi Area', 'Greater Grand Rapids, Michigan Area',
       'Virginia Beach, Virginia', 'Monroe, Louisiana Area',
       'Greater Boston Area', 'San Jose, California',
       'New York, New York', 'Dallas/Fort Worth Area',
       'Amerika Birleşik Devletleri', 'Baton Rouge, Louisiana Area',
       'Chattanooga, Tennessee Area', 'Los Angeles, California',
       'Highland, California', 'Milpitas, California',
       'Greater Atlanta Area', 'Kokomo, Indiana Area',
       'Las Vegas, Nevada Area', 'Cape Girardeau, Missouri'], dtype=object)

In [112]:
df_hr.connection.unique()

array(['85', '44', '1', '61', '500+ ', '390', '57', '82', '5', '7', '16',
       '212', '409', '455', '174', '268', '50', '18', '349', '415', '71',
       '48', '103'], dtype=object)

We can replace all '500+ ' in 'connection' to 500 and convert the 'connection' column to type integer.

In [113]:
#replacing 500+ with 500
df_hr = df_hr.replace("500+ ",500)
df_hr

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500,
...,...,...,...,...,...
93,94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,


In [114]:
#changing 'connection' dtype to int64
df_hr = df_hr.astype({'connection': 'int64'})
df_hr.dtypes

id              int64
job_title      object
location       object
connection      int64
fit           float64
dtype: object

In [115]:
df_hr.job_title.unique()

array(['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
       'Aspiring Human Resources Professional',
       'Aspiring Human Resources Specialist',
       'Student at Humber College and Aspiring Human Resources Generalist',
       'HR Senior Specialist',
       'Seeking Human Resources HRIS and Generalist Positions',
       'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR',
       'Human Resources Coordinator at InterContinental Buckhead Atlanta',
       'Aspiring Human Resources Management student seeking an internship',
       'Seeking Human Resources Opportunities',
       'Experienced Retail Manager and aspiring Human Resources Professional',
       'Human Resources, Staffing and Recruiting Professional',
       'Human Resources Specialist at Luxottica',
       'Director of Human Resources North America, Groupe Beneteau',
       'Retired Army National Guard Recru

One job title seem out of place -'Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621'. I am going to remove this from the data frame

In [116]:
#removing the odd job title entry
df_hr.drop(df_hr.loc[df['job_title']=='Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!!  (408) 709-2621'].index, inplace=True)

In [117]:
#checking if its removed
df_hr.job_title.unique()

array(['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
       'Aspiring Human Resources Professional',
       'Aspiring Human Resources Specialist',
       'Student at Humber College and Aspiring Human Resources Generalist',
       'HR Senior Specialist',
       'Seeking Human Resources HRIS and Generalist Positions',
       'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR',
       'Human Resources Coordinator at InterContinental Buckhead Atlanta',
       'Aspiring Human Resources Management student seeking an internship',
       'Seeking Human Resources Opportunities',
       'Experienced Retail Manager and aspiring Human Resources Professional',
       'Human Resources, Staffing and Recruiting Professional',
       'Human Resources Specialist at Luxottica',
       'Director of Human Resources North America, Groupe Beneteau',
       'Retired Army National Guard Recru

In [118]:
df_hr

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,
7,8,HR Senior Specialist,San Francisco Bay Area,500,
...,...,...,...,...,...
93,94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,


We can add 'experience_status' to distinguish the current status of the candidate's experience in the HR sector. 1 is assigned to candidate who has aspiring/seeking in their job title and everyone else is assigned 0

In [131]:
df_hr['job_title'].str.contains('aspiring|seeking', case=False).astype(int)

0      1
2      1
5      1
6      1
7      0
      ..
93     1
96     1
98     1
99     1
100    0
Name: job_title, Length: 70, dtype: int32

In [None]:
df_hr['experience_status'] = df_hr["job_title"].apply(lambda x: 'aspiring' not in x and word_2 not in x)

In [126]:
df_hr['experience_status'] = [0 for row in df_hr['job_title'] if row.contains('aspiring|seeking', case=False) == True]
df_hr['experience_status'] = [1 for row in df_hr['job_title'] if row.contains('aspiring|seeking', case=False) == False]
df_hr

AttributeError: 'str' object has no attribute 'contains'

In [122]:
if df_hr['job_title'].str.contains('aspiring|seeking', case=False) == True:
    df_hr['experience_status'] = 0
else:
    df_hr['experience_status'] = 1

df_hr


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().