# Exploring Wage Theft Data (SF Jurisdiction Only)

In this exploration, I'm using the dataset put together by Michael Eastman - Asst. District Director of the DOL WHD, San Francisco. This dataset is an aggregate of minimum wage violations, employees affected, penalties assessed. Furthermore, Michael has also merged industry group specific numbers (at the 3 digit NAICS level) such as number of establishments, average weekly pay and size of the employee workforce.

In [1]:
import csv
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import requests
import json
import seaborn as sbrn
import urllib2
import re
import pickle
from sklearn.cluster import k_means
import scipy.spatial as sp
%matplotlib inline

Use the San Francisco District Office jurisdiction dataset

In [2]:
sf_vltns = pd.read_excel('/Users/ash/Downloads/SanFranciscoDistrictDOLCasesAndEmploymentCensus.xlsx')


Here's a sample of the dataset.

In [6]:
sf_vltns.sample(5)

Unnamed: 0,Industry,NAICS3,NAICS Code,Number of Cases,EEs Employed in Violation,Minimum Wage Backwages,Backwages,Minimum Wage Backwages per Employee,Backwages per Employee,Penalty Assessed,Employees in Industry,Avg Weekly Pay,Establishments in Industry
19,Publishing Industries (except Internet),511,51,5,71,28654.88,215358.54,819,3723.0,0.0,75224.0,1525.0,1470.0
94,State and Local Government,911,91,1,1,0.0,5091.23,0,5091.0,0.0,,,
54,Air Transportation,481,48,1,5,507.5,507.5,102,102.0,0.0,17270.0,374.0,119.0
58,Hospitals,622,62,10,253,157.2,149867.78,16,1737.9,0.0,106545.0,982.0,114.0
33,Broadcasting (except Internet),515,51,2,27,8888.5,26147.45,635,1066.5,0.0,4609.0,679.0,200.0


# Severity of Violations  
In order to get a sense of the severity of wage violations - specifically those pertaining to minimum wage - and with the data at hand, we compare the Minimum Wage Backwages owed per e...

In [4]:
sf_vltns['MW_Severity'] = sf_vltns['Minimum Wage Backwages per Employee'] / sf_vltns['Establishments in Industry']


In [5]:
sf_vltns['Emp_AffectedRate'] = sf_vltns['EEs Employed in Violation'] / sf_vltns['Employees in Industry']


In [6]:
sf_vltns['Reported_Rate'] = sf_vltns['Number of Cases'] / np.sum(sf_vltns['Number of Cases'])


In [61]:
sf_vltns.isnull().sum()


Industry                                0
NAICS3                                  0
NAICS Code                              0
Number of Cases                         0
EEs Employed in Violation               0
Minimum Wage Backwages                  0
Backwages                               0
Minimum Wage Backwages per Employee     0
Backwages per Employee                  0
Penalty Assessed                        0
Employees in Industry                  20
Avg Weekly Pay                         20
Establishments in Industry             20
MW_Severity                            20
Emp_AffectedRate                       20
Reported_Rate                           0
dtype: int64

In [52]:
simMat = sf_vltns.iloc[:,-3:]


In [53]:
#get pairwise similarity (use manhattan because I want indsutries that might be similar on a subset of features)
simMat=np.matrix(simMat)


In [44]:
#simMat /= simMat.std(axis=0)


In [54]:
distMat = sp.distance.pdist(simMat, 'cosine')


In [55]:
sp.distance.squareform(distMat)


array([[ 0.        ,  0.54410454,  0.01881966, ...,         nan,
                nan,         nan],
       [ 0.54410454,  0.        ,  0.72446623, ...,         nan,
                nan,         nan],
       [ 0.01881966,  0.72446623,  0.        , ...,         nan,
                nan,         nan],
       ..., 
       [        nan,         nan,         nan, ...,  0.        ,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
         0.        ,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,  0.        ]])

In [57]:
sp.distance.squareform(distMat)[0]

array([  0.00000000e+00,   5.44104537e-01,   1.88196571e-02,
         2.72218270e-02,   2.33586143e-02,   5.99061028e-01,
         6.99719199e-03,   4.63324904e-01,   2.81658754e-02,
         2.37491009e-02,   8.51906592e-04,   2.78574627e-02,
         3.90608759e-01,   2.68428561e-02,   1.59037980e-01,
         7.03192528e-01,   1.71130814e-01,   2.37790529e-02,
                    nan,   2.65344392e-02,   2.77085369e-02,
         2.26558729e-02,   2.23659733e-02,   1.51371125e-02,
         5.03080550e-03,   2.69423464e-02,   1.50567834e-02,
         6.67022108e-01,   7.82595115e-03,   7.99811623e-02,
         2.75205318e-02,   4.90065853e-04,   2.62407720e-02,
         2.78597812e-02,   2.80025518e-02,   1.28299950e-02,
         2.80085078e-02,              nan,              nan,
         2.72444991e-02,   2.61669548e-02,   9.61667578e-03,
                    nan,              nan,              nan,
         1.53931213e-02,   2.24596791e-01,   2.49392284e-02,
                    nan,