## **Auditing Algorithmic Detection of Welfare Fraud**
#### **Final Project Phase 2**
*Tanvi Namjoshi, Dylan Van Bramer, Madeline Demers, Ella White*

In this report we conduct preliminary analysis of the sythetic data sourced by Lighthouse Reports, as well as the welfare fraud algorithm provided. The original source for the data can be found here:[ https://github.com/Lighthouse-Reports/suspicion_machine/tree/main](https://github.com/Lighthouse-Reports/suspicion_machine/tree/main)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

**Part One: Data Imports and Data Cleaning**

The data we sourced was used in the Netherlands, and therefore all the column labels were in Dutch. This data can be found in the file `synth_data.csv`. To help us better understand the data, and as a preliminary cleaning step, we first translated all the column labels into English using Google Translate. This was done manually without code. The translated data can be found in the file `translated_synth_data.csv`. Another note to cleaning, the data journalists that originally sourced this data and the model code did some cleaning of their own. For more information on how they cleaned the data please refer to [https://pulitzercenter.org/stories/suspicion-machines-methodology](https://pulitzercenter.org/stories/suspicion-machines-methodology#:~:text=Appendix%20II.%20Data%20Preparation). In the two code cells below, we display the first 5 rows of both the Dutch and English data files. As there are a lot of features, this does not print every column from the data, but instead gives a sample.

In [3]:
raw_data = pd.read_csv('synth_data.csv')
raw_data.head()

Unnamed: 0,adres_aantal_brp_adres,adres_aantal_verschillende_wijken,adres_aantal_verzendadres,adres_aantal_woonadres_handmatig,adres_dagen_op_adres,adres_recentst_onderdeel_rdam,adres_recentste_buurt_groot_ijsselmonde,adres_recentste_buurt_nieuwe_westen,adres_recentste_buurt_other,adres_recentste_buurt_oude_noorden,...,typering_dagen_som,typering_hist_aantal,typering_hist_inburgeringsbehoeftig,typering_hist_ind,typering_hist_sector_zorg,typering_ind,typering_indicatie_geheime_gegevens,typering_other,typering_transport__logistiek___tuinbouw,typering_zorg__schoonmaak___welzijn
0,6,3,1,0,1012,1,0,0,1,0,...,917,2,0,1,0,0,0,0,0,0
1,4,2,1,0,5268,1,0,0,0,0,...,1603,2,0,1,0,1,0,1,0,0
2,4,2,0,1,1820,1,0,0,1,0,...,-4769,1,0,1,0,0,0,0,0,0
3,3,2,0,1,9056,1,0,0,0,0,...,4189,2,0,1,0,1,0,0,0,0
4,3,3,0,2,5246,1,0,0,1,0,...,502,1,0,1,0,1,0,0,0,0


In [4]:
# Display translated version
translated_data = pd.read_csv('translated_synth_data.csv')
translated_data.head()

Unnamed: 0,address_number_brp_address,address_number_different_districts,address_number_shipping_address,address_number_residential_address_manual,address_days_at_address,address_recent_part_rdam,address_recent_neighborhood_groot_ijsselmonde,address_recent_neighborhood_new_west,address_recent_neighborhood_other,address_recent_neighborhood_old_north,...,typing_days_sum,typing_hist_number,typing_hist_need for integration,typing_hist_ind,typing_hist _sector_care,typing_ind,typing_indication_secret_data,typing_other,typing_transport__logistics___horticulture,typing_care__cleaning___well-being
0,6,3,1,0,1012,1,0,0,1,0,...,917,2,0,1,0,0,0,0,0,0
1,4,2,1,0,5268,1,0,0,0,0,...,1603,2,0,1,0,1,0,1,0,0
2,4,2,0,1,1820,1,0,0,1,0,...,-4769,1,0,1,0,0,0,0,0,0
3,3,2,0,1,9056,1,0,0,0,0,...,4189,2,0,1,0,1,0,0,0,0
4,3,3,0,2,5246,1,0,0,1,0,...,502,1,0,1,0,1,0,0,0,0


**Part Two: Summary Statistics**


In [27]:
# Part 2 (a)
rows = translated_data.count()[0]
columns = len(translated_data.columns)
print("There are", rows, "rows of dta in the dataframe")
print("There are", columns, "features/columns in the dataframe")

#Per sensitive attribute subgroup

#Gender 
female_idx = translated_data.index[translated_data["person_gender_female"]==1]
non_female_idx = translated_data.index[translated_data["person_gender_female"]!=1]
print("The number of data points where the person's gender is female is: ", len(female_idx))
print("The number of data points where the person's gender is NOT female is: ", len(non_female_idx))

#Gender 
female_idx = translated_data.index[translated_data["person_gender_female"]==1]
non_female_idx = translated_data.index[translated_data["person_gender_female"]!=1]
print("The number of data points where the person's gender is female is: ", len(female_idx))
print("The number of data points where the person's gender is NOT female is: ", len(non_female_idx))


#List of all features 
print("The following is a list of feature names: ")
list(translated_data.columns)



There are 12645 rows of dta in the dataframe
There are 315 features/columns in the dataframe
The number of data points where the person's gender is female is:  6103
The number of data points where the person's gender is NOT female is:  6542
The following is a list of feature names: 


['address_number_brp_address',
 'address_number_different_districts',
 'address_number_shipping_address',
 'address_number_residential_address_manual',
 'address_days_at_address',
 'address_recent_part_rdam',
 'address_recent_neighborhood_groot_ijsselmonde',
 'address_recent_neighborhood_new_west',
 'address_recent_neighborhood_other',
 'address_recent_neighborhood_old_north',
 'address_recent ste_neighborhood_vreewijk',
 'address_recent_place_other',
 'address_recent_place_rotterdam',
 'address_recent_wijk_charlois',
 'address_recent_wijk_delfshaven',
 'address_recent_wijk_feijenoord',
 'address_recent_wijk_ijsselmonde',
 'address_recent_wijk_kralingen_c',
 'address_recent_wijk_noord',
 'address_recent_wijk_other',
 'address_recent_wijk _prins_alexa',
 'address_recent_district_city_center',
 'address_unique_district_ratio',
 'appointment_registration_completed',
 'appointment_number_words',
 ' appointment_last_year_appointment_plan',
 'appointment_last_year_monitoring_insp__law_langua

**Part Three: Research Question, Hypotheses, and Analysis Plan**


**Part Four: Modelling**

**Part Five: Results**

**Part Six: Contribution Notes**
* Tanvi: 
* Ella: 
* Dylan: 
* Maddy: 

**Part Seven: Sources Cited**
