## Clustering to clean H1B Job Titles

The goal of this project is to try to replicate the clustering functionality of Google's OpenRefine software. The idea is that in some data fields, unstructured entries that are spelled differently, etc., may really mean the same thing. 

First, let's import the necessary packages.

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

import sklearn.cluster
import distance

The data is taken from governmental records of applications for HB1 visas. You can find it here: https://nyu.app.box.com/s/9oz3qx886zpwwfm6ewj89pvjuee2eqp5. I saved it with a csv extension in Sublime.

First, let's load the data into a dataframe so we can steal the column we want.

In [8]:
df = pd.read_csv("H1B.csv")
df.dtypes

SUBMITTED_DATE           object
CASE_NO                  object
NAME                     object
ADDRESS                  object
ADDRESS2                 object
CITY                     object
STATE                    object
POSTAL_CODE              object
NBR_IMMIGRANTS            int64
BEGIN_DATE               object
END_DATE                 object
JOB_TITLE                object
DOL_DECISION_DATE        object
CERTIFIED_BEGIN_DATE     object
CERTIFIED_END_DATE       object
JOB_CODE                  int64
APPROVAL_STATUS          object
WAGE_RATE_1             float64
RATE_PER_1               object
MAX_RATE_1              float64
PART_TIME_1              object
CITY_1                   object
STATE_1                  object
PREVAILING_WAGE_1       float64
WAGE_SOURCE_1            object
YR_SOURCE_PUB_1         float64
OTHER_WAGE_SOURCE_1      object
WAGE_RATE_2             float64
RATE_PER_2               object
MAX_RATE_2              float64
PART_TIME_2              object
CITY_2  

We're going to be working with job titles. We'll group the dataframe by titles, and then extract each one to a list to pass into a clustering algorithm.

In [38]:
titles = df.groupby('JOB_TITLE')

jobs = []
counter = 0

for group in titles.groups:
    if group not in jobs:
        jobs.append(group)
        counter += 1
    if counter >= 200:
        break

jobs_array = np.asarray(jobs)

As you can see if you scroll through the list, even once we've taken the unique titles out of the dataframe, there are tons of overlapping positions. There are lower case and upper case, words switched around, misspellings, etc. If we want to make this data useful and visualize it, we'll need to clean this up.

The way to calculate the similarity between strings is called the "Levenshtein Distance." Code borrowed from http://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups.

In [47]:
lev_similarity = -1 * np.array([[distance.levenshtein(j1,j2) for j1 in jobs_array] for j2 in jobs_array])

So we've created a matrix (in array form) of the Levenshtein Distance of each job title from the other job titles in the original jobs array. Here's what it looks like:

In [48]:
lev_similarity

array([[  0, -25, -25, ..., -24, -24, -26],
       [-25,   0, -42, ..., -37, -39, -42],
       [-25, -42,   0, ..., -24, -19, -19],
       ..., 
       [-24, -37, -24, ...,   0, -23, -24],
       [-24, -39, -19, ..., -23,   0, -16],
       [-26, -42, -19, ..., -24, -16,   0]])

Now we'll cluster these values. Affinity Propagation seems to be the right algorithm for the job, since we've already calculated the Levenshtein Distances for our jobs array. 

Affinity Propagation seems similar to K-Means, but instead of clustering and then re-iterating, the algorithm sends messages from data to other data to figure out what's close and what's not. K-Means, on the other hand, chooses random centroids (not the case in AP) and then figures out which points are closest. I've learned the small amount I know from here: http://www.psi.toronto.edu/affinitypropagation/faq.html.

Another super important (and helpful) feature of Affinity Propagation is that we don't need to specify the number of centroids / exemplars, which is key for the nature of our data set.

In [100]:
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping = 0.5)
affprop.fit(lev_similarity)

AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
          damping=0.5, max_iter=200, preference=None, verbose=False)

Now that we fit the model, let's print out all of the clusters into a dictionary.

In [139]:
def orderClusters(array):
    
    clusters = {}
    
    for cluster_id in np.unique(affprop.labels_):

        exemplar = array[affprop.cluster_centers_indices_[cluster_id]]

        cluster = np.unique(array[np.nonzero(affprop.labels_==cluster_id)])

        if exemplar not in clusters:
            clusters[exemplar] = cluster
            
    return clusters

In [140]:
clusters = orderClusters(jobs_array)
clusters

{'Account Exexcutive': array(['Account Executive I', 'Account Exexcutive', 'Accountant/Bursar',
        'International Sales Account Executive', 'Primary Care Provider'], 
       dtype='|S50'),
 'Administrator-Orthodontics Lab Manager': array(['Administrator-Orthodontics Lab Manager'], 
       dtype='|S50'),
 'Application Engineer': array(['Adjunct Trainer', 'Application Engineer', 'Application Engineer I',
        'Associate Applications Engineer', 'Chief Sales Engineer',
        'Mathematician and Math Modeler',
        'Public Relations & Legal Coordinator',
        'Receptionist/Information clerk',
        'Site Patrol Implementation Engineer', 'immplemetation programmer'], 
       dtype='|S50'),
 'Architect ': array(['ARCHITECT', 'Ag Tech III', 'Aircraft Upholsterer', 'Architect ',
        'Auditor', 'Black Belt', 'Botanist', 'IT Architect',
        'News Assistant II', 'Pharmacist', 'Phlebotomist',
        'Quality Inspector', 'Rearch Fellow', 'Reporter',
        'Specialist II-I

Definitely not bad for a first run through! Starting from the top, a lot of these make sense. The first key is 'Account Executive', and the cluster includes values like 'Accountant/Bursar', which is indeed similar.

But there are also a few pretty bad ones. 'Primary Care Provider' certainly shouldn't be in the 'Account Executive' cluster, and should probably be its own exemplar.

If we adjust the "preference" argument, we can force the algorithm to employ more clusters. To see how changing the preference changes our cluster size, we'll run it a few different times to find the length of the dictionary (i.e. the number of exemplars).

In [165]:
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping = 0.5, preference = -10)
affprop.fit(lev_similarity)
clusters = orderClusters(jobs_array)
len(clusters)

139

In [174]:
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping = 0.5, preference = -5)
affprop.fit(lev_similarity)
clusters = orderClusters(jobs_array)
len(clusters)

192

The total number of data points that we grabbed is close to being reached, so we can see that as the preference approaches 0, we approach no clusters.