## Clustering to clean H1B Job Titles

The goal of this project is to try to replicate the clustering functionality of Google's OpenRefine software. The idea is that in some data fields, unstructured entries that are spelled differently, etc., may really mean the same thing. 

First, let's import the necessary packages.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

import sklearn.cluster
import distance

The data is taken from governmental records of applications for HB1 visas. You can find it here: https://nyu.app.box.com/s/9oz3qx886zpwwfm6ewj89pvjuee2eqp5. I saved it with a csv extension in Sublime.

First, let's load the data into a dataframe so we can steal the column we want.

In [3]:
df = pd.read_csv("H1B.csv")
df.dtypes

SUBMITTED_DATE           object
CASE_NO                  object
NAME                     object
ADDRESS                  object
ADDRESS2                 object
CITY                     object
STATE                    object
POSTAL_CODE              object
NBR_IMMIGRANTS            int64
BEGIN_DATE               object
END_DATE                 object
JOB_TITLE                object
DOL_DECISION_DATE        object
CERTIFIED_BEGIN_DATE     object
CERTIFIED_END_DATE       object
JOB_CODE                  int64
APPROVAL_STATUS          object
WAGE_RATE_1             float64
RATE_PER_1               object
MAX_RATE_1              float64
PART_TIME_1              object
CITY_1                   object
STATE_1                  object
PREVAILING_WAGE_1       float64
WAGE_SOURCE_1            object
YR_SOURCE_PUB_1         float64
OTHER_WAGE_SOURCE_1      object
WAGE_RATE_2             float64
RATE_PER_2               object
MAX_RATE_2              float64
PART_TIME_2              object
CITY_2  

We're going to be working with job titles. First, let's take a look at what our job titles look like so we can understand the problem. We'll group the dataframe by titles, and then extract each one to a numpy array.

In [16]:
titles = df.groupby('JOB_TITLE')

In [17]:
jobs = []
counter = 0

for group in titles.groups:
    if group not in jobs:
        jobs.append(group)
        counter += 1
    if counter >= 300:
        break

jobs_array = np.asarray(jobs)

In [18]:
jobs_array[:20]

array(['Software Engineer (Consultant)',
       'Software Engineer (Software Development Director)',
       'Assistant VP - Economist',
       'VICE PRESIDENT & CHIEF OPERATING OFFICER', 'PHYSICIAN RESIDENT',
       'Network Manager', 'IT Architect', 'BUSINESS DEVELOPMENT MANAGER ',
       'PostDoctoral Fellow', 'Adjunct Trainer',
       'PATENT SPECIALIST(Chemical Arts)', 'Staff Research Associate',
       'SR. FINANCIAL TECHNOLOGY ADVISOR', 'Software Project Engineer',
       'COMPUTER SUPPORT SPECIALIST', 'PGY 4 Medical Resident/Fellow ',
       'DENTAL OFFICE MANAGER AND DENTAL ASSISTANT', 'PROGAMMER ANALYST',
       'Web Applications Developer', 'Computer Systems Administrator'], 
      dtype='|S50')

As you can see if you scroll through the list, even once we've taken the unique titles out of the dataframe, there are tons of overlapping positions. There are lower case and upper case, words switched around, misspellings, etc. If we want to make this data useful and visualize it, we'll need to clean this up.

First, we can change all of the terms to lowercase. We can argue that it's a good idea to keep the punctuation, but we'll remove it to make it easier on the clustering later on.

In [19]:
#Convert all titles to lower case

for i in range(len(jobs_array)):
    jobs_array[i] = jobs_array[i].lower()

#Strip punctuation

for i in range(len(jobs_array)):
    jobs_array[i] = jobs_array[i].strip('/.,:;-–')

Ok, now our data is ready for clustering.

The way to calculate the similarity between strings is called the "Levenshtein Distance." Code borrowed from http://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups. More info on the distance formula here: https://rosettacode.org/wiki/Levenshtein_distance#Python.

In [20]:
lev_similarity = -1 * np.array([[distance.levenshtein(j1,j2) for j1 in jobs_array] for j2 in jobs_array])

So we've created a matrix (in array form) of the Levenshtein Distance of each job title from the other job titles in the original jobs array. Here's what it looks like:

In [21]:
lev_similarity

array([[  0, -25, -23, ..., -19, -18, -28],
       [-25,   0, -40, ..., -38, -36, -37],
       [-23, -40,   0, ..., -17, -18, -22],
       ..., 
       [-19, -38, -17, ...,   0, -12, -24],
       [-18, -36, -18, ..., -12,   0, -25],
       [-28, -37, -22, ..., -24, -25,   0]])

Now we'll cluster these values. Affinity Propagation seems to be the right algorithm for the job, since we've already calculated the Levenshtein Distances for our jobs array. The algorithm was first proposed for this purpose here:http://science.sciencemag.org/content/315/5814/972.

Affinity Propagation seems similar to K-Means, but instead of clustering and then re-iterating, the algorithm sends messages from data to other data to figure out what's close and what's not. K-Means, on the other hand, chooses random centroids (not the case in AP) and then figures out which points are closest. Info from here: http://www.psi.toronto.edu/affinitypropagation/faq.html.

Another super important (and helpful) feature of Affinity Propagation is that we don't need to specify the number of centroids / exemplars, which is key for the nature of our data set.

In [22]:
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping = 0.5)
affprop.fit(lev_similarity)

AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
          damping=0.5, max_iter=200, preference=None, verbose=False)

Now that we fit the model, let's print out all of the clusters into a dictionary.

In [23]:
def orderClusters(array):
    
    clusters = {}
    
    for cluster_id in np.unique(affprop.labels_):

        exemplar = array[affprop.cluster_centers_indices_[cluster_id]]

        cluster = np.unique(array[np.nonzero(affprop.labels_==cluster_id)])

        if exemplar not in clusters:
            clusters[exemplar] = cluster
            
    return clusters

In [24]:
clusters = orderClusters(jobs_array)
clusters

{'account exexcutive': array(['account executive i', 'account exexcutive', 'accountant/bursar'], 
       dtype='|S50'),
 'adminidtrative manager': array(['adminidtrative manager', 'administrative assistant',
        'administrative officer', 'administrator-orthodontics lab manager',
        'document control manager', 'industrial production manager',
        'systems adminstrator/e-commerce manager'], 
       dtype='|S50'),
 'application engineer': array(['application developer', 'application engineer',
        'application engineer i', 'asic design engineer',
        'associate applications engineer',
        'electrical & electronics engineer', 'electrical engineer',
        'immplemetation programmer', 'operations engineer',
        'public health engineer ii', 'site patrol implementation engineer',
        'specifications engineer', 'supply chain application engineer'], 
       dtype='|S50'),
 'architect ': array(['actuary', 'adjunct instructor', 'ag tech iii',
        'archictectu

Definitely not bad for a first run through! Starting from the top, a lot of these make sense. The first key is 'Account Executive', and the cluster includes values like 'Accountant/Bursar', which is indeed similar.

If we adjust the "preference" argument, we can force the algorithm to employ more clusters. To see how changing the preference changes our cluster size, we'll run it a few different times to find the length of the dictionary (i.e. the number of exemplars).

In [25]:
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping = 0.5, preference = -10)
affprop.fit(lev_similarity)
clusters = orderClusters(jobs_array)
len(clusters)

179

In [26]:
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping = 0.5, preference = -5)
affprop.fit(lev_similarity)
clusters = orderClusters(jobs_array)
len(clusters)

269

The total number of data points that we grabbed is close to being reached, so we can see that as the preference approaches 0, we approach no clusters.

So how the heck are we supposed to know how many clusters are correct = what preference to use? Well, that's a great question, and the subject of this exact research paper from Cornell: https://arxiv.org/abs/0805.1096. The basic idea is – keep iterating until you converge on the right amount of clusters. It's called "Adaptive Affinity Propagation."

The question is – how can we define a measure for the "right" amount of clusters? 