# Texas wage theft data details

We shared the results of our analysis with the Texas Workforce Commission so they could comment. After receiving no response for about two weeks, Brian New, our local reporter in Texas interviewed the director of the Commission. In that interview, the director told Brian that they gave us the wrong fields for several of the fields we requested. This notebook will document how we're handling this in our analysis. There are also additional notes in notebooks/national_analysis/national_analysis.ipynb. 

In [1]:
import pandas as pd


In [2]:
df = pd.read_excel("input/Copy_of_RESULTS_-_MEDIA_REQUEST_-_January_1_2010_to_Jul_14_2022.xlsx")


They informed us there are duplicates in the data. These have also been addressed using a new processor in task 1

In [3]:
df = df.drop_duplicates("WAGE_CLAIM_ID")


We were told that their data does not reliably track the amounts of money that claimants received, despite the fact that a field exists in their database and appears to be used some of the time. They informed us that these numeric status codes are used to designate cases that are paid. 

In [4]:
PAID_STATUS_CODES = [950, 970]


These same figures are also now reflected in notebooks/national_analysis/national_analysis.ipynb. 

In [5]:
(
    df.assign(paid=lambda x: x.FK_VCMPLNT_STSCD.isin(PAID_STATUS_CODES))
    .query("AWARDED == 'YES'")
    .paid.value_counts()
    .to_frame("n")
    .transpose()
    .assign(pct_true=lambda x: x[True] / x.sum(axis=1))
)


Unnamed: 0,True,False,pct_true
n,33540,26755,0.556265


For comparison, this is the figure you get when you use the paid amounts. 

In [6]:
(
    pd.read_excel("input/ORR_R005317-081222_from_CBS__C._Hacker__File_date___Amts.xlsx")
    .assign(paid=lambda x: x.PAID.notna() & x.PAID != 0)
    .query("CLAIMED != 0")
    .paid.value_counts()
    .to_frame("n")
    .transpose()
    .assign(pct_true=lambda x: x[True] / x.sum(axis=1))
)


Unnamed: 0,False,True,pct_true
n,106124,13341,0.111673


The TWC also informed us that they provided the wrong date field for the case end date. They gave us the status date, which designates the most recent change to the case, such as filing an appeal or entering a lien. They told us that they would rather us use the date a preliminary decision was reached, which gives you a 65 day case duration rather than the > 200 day duration we had originally. However, we believe that the status date is fair to use. The preliminary decision does not mean a person was paid, and if a case is appealed or penalties are applied to a business, that still means that the case is functionally pending for the claimant. 

They did not provide the decision date field. 