# Investigating Similar Violation Patterns

Let's consider a hypothetical scenario where the WHD investigator has just concluded investigating [say] some firm that's classified in the *Breakfast Cereal Manufacturing Facility* group under *6 digit NAICS* **311230**, in the *Santa Rosa, CA* Metropolitan Statistical Area (MSA). This firm has been found guilty of certain types of violations that are concerning and that affect at-risk populations. 

Given the scenario above, and having only the main DOL WHD historical investigations dataset at one's disposal, how might the DOL weed out similar patterns of violations. The obvious way is to look into *similar* firms under the same industry code for similar patterns of violations. But what if there were very similar sorts of violations occuring in some other industry, and in another region... there'd be no way of knowing this unless there is some sort of *anecdotal evidence*. Well, that's until now:

Let's consider building a big vector space where each vector is a NAICS6 + MSA (specific industry sub-group in an MSA) with information around the violation *severity* metric (calculated in the previous section) along with some other specific attributes. Upon computing an appropriate *distance* between each such vector in this multi-dimensional space, we could get a sense of how *similar* one vector is to every other vector. This, in essence, translates into similar violation patterns between industry groups across MSAs in the *known* dataset.  

Great, this is progress. But this doesn't quite solve the problem of *observation bias* in the dataset - the investigator is able to understand similarity in violation patterns amongst known violators in known industry sub-groups. In our example, this means all other known 6-digit NAICS violators. **But what about the unknown industry sub-groups out there?** If we make certain assumptions on demographics, class, sex, education, income status, household, it's entirely possible that the resulting class of workforce **is** encountering similar violations in some other industry that's not in the dataset.  

Here's one little trick on how we might discover other violators. The trick lies in building our vector space *up* in the industry hierarchy, for example, at the 3-digit NAICS level. Here, *Breakfast Cereal Manufacturing Facility* is grouped into *Food Manufacturing* under the 3-digit code: 311. The following is key for the exercise below:  

> We build the similarity space on the assumption that a higher hierarchy of the NAICS group captures certain characteristics of workforce populations within that group.  

This way, if we're able to identify target sub-groups of a *similar* violator at the 3-digit NAICS, we can drill down to sub-groups that are not in the dataset to watch for similar violation patterns.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbrn
from sklearn.cluster import k_means
from sklearn.preprocessing import scale
import numpy as np
import scipy.cluster.vq as scp_vq
from scipy.spatial.distance import cdist
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#read in dataset prepared in part 2
dat = pd.read_csv('./../data/california_violtns_MSA_NAICS3.csv')

In [5]:
pd.set_option('max_columns',50)
dat.sample(3)

Unnamed: 0.1,Unnamed: 0,MSA,MSADSCR,NAICS3,NAICS3Desc,MinWage_ATPAmt,BMW_Cases,cmp_assd_cnt,ESTB,MinWage_Cases,case_violtn_cnt,is_violator,ee_atp_cnt,All_ATPAmt,ChildLabor_EmpAff,ee_violtd_cnt,PAYR_N,MinWage_EmpAff,BMW_ATPAmt,BMW_EmpAff,EMPL_N,Other_ATPAmt,Other_Cases,Other_EmpAff,ChildLabor_ATPAmt,ChildLabor_Cases,All_AtpAmt,num_investigations,km_cluster,MW_Vltn_Svrty,BMW_Vltn_Svrty,CL_Vltn_Svrty,OTHER_Vltn_Svrty,OVERALL_Vltn_Svrty
861,861,47300.0,"Visalia-Porterville, CA Metro Area",713,"Amusement, Gambling, and Recreation Industries",6448.39,0,0.0,37.0,16,16,2,9,19345.17,0,16,24865.0,9,0.0,0,1272.0,12896.78,0,7,0.0,0,19345.17,3,0,45.625401,0.0,0.0,70.972846,243.3355
22,22,12540.0,"Bakersfield, CA Metro Area",485,Transit and Ground Passenger Transportation,0.0,0,0.0,14.0,0,2,1,1,2186.08,0,1,0.0,0,0.0,0,0.0,2186.08,2,1,0.0,0,2186.08,1,0,,,,inf,inf
765,765,44700.0,"Stockton-Lodi, CA Metro Area",326,Plastics and Rubber Products Manufacturing,52604.31,0,0.0,30.0,85,85,1,84,157812.93,0,84,45871.0,84,0.0,0,1156.0,105208.62,0,0,0.0,0,157812.93,1,0,3822.458512,0.0,0.0,0.0,11467.38


We want to carefully consider our vector attributes to make sure we're not including redundant attributes. The objective is to maximize the information that each feature adds in defining the vector. Since we know this domain inside-out by now, let's define the features we want to include to define our vector. *Recall, each vector represents the violation characteristics of a 3 digit NAICS industry group in a Metropolitan Statistical Area.*

In [7]:
vector_defn = ['MW_Vltn_Svrty','BMW_Vltn_Svrty','CL_Vltn_Svrty','OTHER_Vltn_Svrty','OVERALL_Vltn_Svrty','PAYR_N']

The *severity* metrics created for each violation type in the previous section pretty much account for most of th einformation points including employees affected as a percentage of the workforce in that MSA, and backwages owed. It is severity that we're concerned with. I throw in the total NAICS payroll in the MSA as an additional feature to try and generalize well to *similarity* of the types of industries that might employ similar sorts of populations (and that might be flying under the radar with violation patterns observed in other places).

In [16]:
#extract matrix of features
dat_sub = dat[vector_defn]

In [20]:
dat_sub.shape

(884, 6)

Some of our Severity metrics have got a NaN either because there were no violations discovered, or because of erroneous MSA numbers. Let's drop these.

In [23]:
dat_sub.dropna(axis=0, how='any',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [24]:
dat_sub.shape

(814, 6)

In [26]:
#normalize feature space for each vector (normalize each feature on the entire space)
#i use the convenient whiten() function in the scipy package to normalize based on the standard dev of each feature
dat_sub_norm = scp_vq.whiten(dat_sub)

Now let's generate a similarity matrix (pairwise distances) between each vector. We come into an important consideration of which distance metric to use here. Everyone and their mother has invented a [distance metric](http://docs.scipy.org/doc/scipy/reference/spatial.distance.html), not that I'm complaining. General literature suggests picking the appropriate metric for your problem space, which is easier said than done really. 

I decided to go with *Minkowski Similarity* because it generalizes well in a multi-dimensional numeric space. It's a good compromise between the *Euclidean distance* which looks for the shortest straight line between two points and the *Manhattan Distance* which computes the city block measure between two points. 

The best intuitive explanation I've found of this distance is at this [page](http://www.code10.info/index.php%3Foption%3Dcom_content%26view%3Darticle%26id%3D61:articleminkowski-distance%26catid%3D38:cat_coding_algorithms_data-similarity%26Itemid%3D57):  
> the Minkowski metric of the order ∞ returns the distance along that axis on which the two objects show the greatest absolute difference. 


In [32]:
dist_mat = pairwise_distances(dat_sub_norm, metric='minkowski', p=6)

Well that was easy. All the 0's along the diagonal of the object are expected as each vector is most similar to itself:

In [34]:
dist_mat.shape

(814, 814)

In [44]:
dat_nonull = dat.dropna(axis=0,how='any')

In [45]:
dat_nonull.shape

(814, 34)

In [62]:
#function to take in one node (industry) and return similar nodes 

def single_node_json(mat, labels, sourceIndex, clusters):
    links=[]
    nodes=[]
    for i in range(mat.shape[0]):
        #source = i[0]
        target = i
        val = mat[sourceIndex, target]

        if val<0.05:
            links.append({'source':sourceIndex, 'target':target, 'value':val})
            #nodes.append({'name':labels[i], 'group':clusters[i]})
        
        
    #nodes=[]
    for i in range(mat.shape[0]):
        #nodes.append({'name':ntnl_flsa_summ['naics_code_description'][i], 'group':np.random.randint(low=1,high=5)})
        nodes.append({'name':labels[i], 'group':clusters})
    
    graphdict={'nodes':nodes, 'links':links}
    return graphdict


In [63]:
single_node_json(dist_mat,
                zip(dat_nonull['NAICS3'],dat_nonull['NAICS3Desc']),
                 10, 0)

{'links': [{'source': 10, 'target': 4, 'value': 0.049998617650284086},
  {'source': 10, 'target': 10, 'value': 0.0},
  {'source': 10, 'target': 29, 'value': 0.022163439816944856},
  {'source': 10, 'target': 80, 'value': 0.03652719097501754},
  {'source': 10, 'target': 81, 'value': 0.0056278533556312462},
  {'source': 10, 'target': 97, 'value': 0.041547908250903688},
  {'source': 10, 'target': 139, 'value': 0.035574612650886996},
  {'source': 10, 'target': 156, 'value': 0.020525372795756017},
  {'source': 10, 'target': 162, 'value': 0.030370329860003493},
  {'source': 10, 'target': 163, 'value': 0.047549749045052915},
  {'source': 10, 'target': 213, 'value': 0.0059010704221652572},
  {'source': 10, 'target': 275, 'value': 0.041619461634854207},
  {'source': 10, 'target': 283, 'value': 0.01179732197331195},
  {'source': 10, 'target': 289, 'value': 0.042710664364180528},
  {'source': 10, 'target': 291, 'value': 0.0054641768897295516},
  {'source': 10, 'target': 300, 'value': 0.02591278081

In [54]:
import json

In [64]:
with open('single_node_similarity2.json','w') as f:
    json.dump(single_node_json(dist_mat,
                zip(dat_nonull['NAICS3'],dat_nonull['NAICS3Desc']),
                 10, 0), 
              f)