# Choosing the Ideal Job Type for an Applicant

### Using clustering analysis to find the right fit

In the previous project, an attempt was made to create a best-fit model to predict the future success of a worker brought on by a recruiter. In this project, we will see if we can predict the best type of work for a new recruit, to help guarantee success in the future.

### 1. Pull in and pre-process data

In [2]:
# Import libraries
import numpy as np
import pandas as pd

In [3]:
# Read in the worker data

xls_file = pd.ExcelFile("Origami_Data.xlsx", encoding = 'utf-8')
worker_data = xls_file.parse('Client Information')
print "worker data read successfully!"

worker data read successfully!


In [13]:
comments = worker_data["Comments from the employer"]

type(comments)

pd.Series(comments).str.cat(sep=' ')

u"Good worker. Willing to work extra hours. Seems to have working knowledge but hard to keep them at the office Like the process but definitely not a good fit with this candidate Was late to work  You need better screening before sending us people Late to work Once we found a good fit for them, they have been doing well Thank you for finding someone. Simply didn\u2019t work out. Works a lot, promotion is going to be given Employee retired Hard to balance their schedule with health needs Fantastic worker Happy so far with the placement Health issues on the job. Not a good fit. Manual labor was too tiring.  Error rate was unnaceptable Many unscheduled absences No reliable mode of transportation Your process made this very difficult Thank you  Thank you for finding a good fit such short notice Have enjoyed his work immensely Thank you for continued partnership Happy so far with the placement thx Not impressed with you or employee simply didn't work. She quit 2 weeks in We get a lot of peo

In [19]:
n_office = np.shape(worker_data[worker_data['OFFICE/MANUAL']=='OFFICE'])[0]
n_manual = np.shape(worker_data[worker_data['OFFICE/MANUAL']=='MANUAL'])[0]

print "Number of workers in the office field: {}".format(n_office)
print "Number of workers in the manual labor field: {}".format(n_manual)

Number of workers in the office field: 93
Number of workers in the manual labor field: 96


#### Clean up values
First make sure to clean up non-consistent data in columns state and gender

In [20]:
# Make all state data shorthand and include gender only with M or F
worker_data['State'] = map(lambda x: x.lower(), worker_data['State'])
worker_data = worker_data.replace({'alabama':'al','florida':'fl','georgia':'ga','south carolina':'sc','louisiana':'la'}, regex=True)
states = ['al','fl','ga','la','sc']
worker_data = worker_data.loc[worker_data['State'].isin(states)]
gender = ['M','F']
worker_data = worker_data.loc[worker_data['Gender'].isin(gender)]

# Remove NaN
worker_data2 = worker_data.dropna(axis = 0, how = 'any', subset = ['Employed In Past 6 Months','Gender','Age','State','Education Level'])


In [21]:
# Extract feature (X) and target (y) columns, and removing ID and Comments columns
feature_cols = ['Employed In Past 6 Months','Age','OFFICE/MANUAL','State','Education Level','Gender']
target_col = ['Placement Successful']


x_all = worker_data2[feature_cols]
y_all = worker_data2[target_col]

#### Preprocess feature columns

It turns out there are a few non-numeric columns that need to be converted! One of them is simply `yes`/`no`, e.g. `'Employed In Past 6 Months'`. This can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `State` and `Education Level`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `AL`, `GA`, `FL`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are called _dummy variables_, and so we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to create these columns.