# Bias in Known Dataset

We haven't yet merged any external data sources. So we'll try to get a sense of bias by comparing *severity* of violations to number of cases investigated. Given the dataset we have, we're able to use the features to measure *severity* of a violation as:  

> Violation Severity = Backwages Owed / Num. Employees Affected  

In [2]:
import pandas as pd
import numpy as np
#import zipfile
#import requests
#import StringIO
import matplotlib.pyplot as plt
import seaborn as sbrn
import whd_utilities as whduts #my own python script for accompanying utility functions

In [9]:
whd_grps = pd.read_csv('./../data/whd_groupedViolations.csv', low_memory=False)

In [10]:
whd_grps.drop('Unnamed: 0',axis=1,inplace=True) #wonky column

In [18]:
pd.set_option('max_columns',109)
whd_grps.head(3)

Unnamed: 0,trade_nm,cty_nm,zip_cd,st_cd,naic_cd,naics_code_description,case_violtn_cnt,cmp_assd_cnt,ee_violtd_cnt,ee_atp_cnt,MinWage_ATPAmt,BelowMinWage_Cases,BelowMinWage_ATPAmt,MinWage_Cases,BelowMinWage_EmpAff,MinWage_EmpAff,All_AtpAmt,Other_Cases,Other_ATPAmt,Other_EmpAff,is_violator,MW_vltn_severity,BMW_vltn_severity,All_vltn_severity,Other_vltn_severity
0,Anid Care Home,Ionia,48846.0,MI,623990,Other Residential Care Facilities,3,0.0,1,0,0.0,0,0.0,3,0,0,0.0,0,0.0,1,1,,,,0.0
1,Eye Land Vision,Houston,77082.0,TX,446130,Optical Goods Stores,11,0.0,10,10,2407.62,0,0.0,11,0,10,7222.86,0,4815.24,0,1,240.762,,722.286,inf
2,Bella Vita School (The),Longmont,80501.0,CO,624410,Child Day Care Services,2,0.0,0,0,0.0,0,0.0,2,0,0,0.0,0,0.0,0,1,,,,


In [16]:
# Severity of various types of violations
whd_grps['MW_vltn_severity'] = whd_grps['MinWage_ATPAmt'] / whd_grps['MinWage_EmpAff']
whd_grps['BMW_vltn_severity'] = whd_grps['BelowMinWage_ATPAmt'] / whd_grps['BelowMinWage_EmpAff']
whd_grps['All_vltn_severity'] = whd_grps['All_AtpAmt'] / whd_grps['ee_atp_cnt']
whd_grps['Other_vltn_severity'] = whd_grps['Other_ATPAmt'] / whd_grps['Other_EmpAff']

Let's see how this looks, when compared to the number of investigations taht the WHD conducted. The terminology is a bit confusing here because we don't want to use the *case_violtn_cnt* type metric, which often corresponds one to one to the number of employees affected. Instead, we'll want to treat each row with its own case id (removed from this dataset earlier) as an *investigated incident*.   

In order to enable this, we'll need to count the number of *investigations* for the company in that city, zip, and state. Later, we will sum up our numbers to the MSA or State level.

In [32]:
trade_vltn_cnts = pd.DataFrame(whd_grps.groupby(['cty_nm', 'zip_cd','trade_nm']).size(), columns=['num_investigations'])

In [35]:
whd_grps=whd_grps.merge(trade_vltn_cnts, how='inner',left_on=['cty_nm','zip_cd','trade_nm'], right_index=True, left_index=False)

# Statististics of US Businesses

What we need to understand backwages owed to the proportion of total employees affected. Now, while this number is nearly impossible to get at an individual company level, the Bureau of Labor Statistics does provide us with an industry code driven breakdown of total employees. We'll therefore have to repurpose our analysis at this level. 

The BLS 

In [36]:
#Tableau...
#top 50 
whd_grps.to_csv('./../data/whd_groupedViolations.csv')