**Project 3: Urban Ministries of Durham**

**Author: Jesus Vazquez**

**Background Description**

Urban Ministries of Durham assist around 6,000 people each year who need food, shelter, clothing and/or supportive services. Some of the programs that they offer is a (1) community shelter where people are given a place to sleep and assistance to help them find a home, a (2) Community Café that serves three meals a day, seven days a week, 365 days a year, and a (3) food and clothing closet for those that either need food or clothing. 

**Motivation**

Just as with any other organization, validation of the organization efforts is needed to understand if Urban Minisitries of Durham is doing a good job at helping people transform their lives. Having this information at hand when meeting with donors can give a sense of security that their donations are not going to waste. Not only are the efforts of counselors into helping homeless find a home important but one should also consider how finances are being managed at the homesless shelter. For this reason the tax records will be examined to show if there are any signs of fradulent behavior. To inspect the work of counselors, this project will examine if the shelter reduces the number of diabilities reported by new-incomers by the time they find a home.

**Approach**

To determine the financial performance of the organization from the Urban Ministries of Durham the first digit of the net income, expenses and income-expenses will be considered. Theoretically, this should follow a Bendford distribution and any deviations from this distribution would suggest fabricated numbers. For the second part of the project, we will count the number of patients that came in with a condition (e.g. - mental health problem) and left with (1) none, (2) the same, or (3) with more problems. The change in health related conditions will be illustrated using bar-plots, stratified by ethnicity.





In [78]:
# Importing packages 
import numpy as np
from scipy import stats
import pandas as pd
import sys
import math

**Part 1**

Determine the financial performance of the organization from the Urban Ministries of Durham the first digit of the net income, expenses and income-expenses will be considered. Theoretically, this should follow a Bendford distribution and any deviations from this distribution would suggest fabricated numbers. This part of the project conducts the statistical analysis and R-studio will create the graph.

In [71]:
# Import Data Tax Data 
df = pd.read_csv(r'C:\Users\15056\Documents\BIOS611\bios611-projects-fall-2019-jvazquez2\project_3\data\IRS.csv')
df = df[df.Net >= 0]
data = df['Net']

The below code will count the first digit of each number. This part of the project was 
facilitated by Elena C. You can find the original code in her Github page: https://github.com/eleprocha/Benford-s-Law_python_code/blob/master/code Additional modifications of the code can be found below.

In [102]:
# Counting the first digit
def count_first_digit(data_str):
    mask=df[data_str]>1.
    data=list(df[mask][data_str])
    for i in range(len(data)):
        while data[i]>10:
            data[i]=data[i]/10
    first_digits=[int(x) for x in sorted(data)]
    unique=(set(first_digits))#a list with unique values of first_digit list
    data_count=[]
    for i in unique:
        count=first_digits.count(i)
        data_count.append(count)
    total_count=sum(data_count)
    data_percentage=[(i/total_count)*100 for i in data_count]
    return  total_count,data_count, data_percentage

total_count, data_count_pre, data_percentage_pre = count_first_digit("Net")

# In this case data_count_pre and data_percentage_pre have the format of [1,3,4,5,6,8]. 
# Need to add zeros for 2, 7, 9 
data_count = []
data_percentage = []
numbers = []
j = 0
for i in range(1,10):
    numbers.append(i)
    if i in (2,7,9):
        data_count.append(0)
        data_percentage.append(0)
    else:
        data_count.append(data_count_pre[j])
        data_percentage.append(data_percentage_pre[j])
        j = j+1
            
            
# Benford's Law percentages for leading digits 1-9
BENFORD = [30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6]

[4, 0, 1, 1, 3, 2, 0, 1, 0]

**Large Sample Theory**

In [73]:
# Get the expected number of counts
expected_counts=[round(p * total_count / 100) for p in BENFORD]

# Observed vs Expected  
chi_square_stat = 0  # chi square test statistic
for data, expected in zip(data_count,expected_counts):
    chi_square = math.pow(data - expected, 2)
    chi_square_stat += chi_square / expected

print("\nChi-squared Test Statistic = {:.3f}".format(chi_square_stat))
print("Critical value at a P-value of 0.05 is 15.51.")


Chi-squared Test Statistic = 9.500
Critical value at a P-value of 0.05 is 15.51.


Since our chi-square test statistic of 9.50 was less than the critical calue of 15.15 we fail to reject the null hypothesis. We conclude that the income tax data from Urban Ministries of Durham does not provide sufficient evidence to claim that the non-for-profit engages in fraudulent behavior when reporting taxes. One of the the limitations from this analysis was that we had few observations. One of the recommendations for the test is to have at least 50 obs but in this case we had only 12. 

**Small Sample Theory** 

Since n is small we will be using the Wilcoxon Rank Sum Test to verify that the two distributions are identical. 

In [85]:
print("actual counts are", data_count)
print("expected counts are", expected_counts)

# Wilxocon Rank-Sum Test
scipy.stats.mannwhitneyu(data_count,expected_counts)

actual counts are [4, 0, 1, 1, 3, 2, 0, 1, 0]
expected counts are [4, 2, 2, 1, 1, 1, 1, 1, 1]


MannwhitneyuResult(statistic=32.5, pvalue=0.2384481543359947)

Just as with the Chi-Square test, the Wilcoxon Rank Sum Test shows that the data does not provide sufficient data to reject the assumption that the expected and the observed values follows the same distribution. We can conclude that the income tax reports from Urban Ministries of Durham do not provide sufficient data to suggest that the non-for-profit engaged in disingenuous tax filing behavior. 

In [112]:
# Save data and export to R-Studio to create graphs
out_tax = pd.DataFrame(columns=['Number', 'Observed Count', 'Expected Count', 'Observed Percentage', 'Expected Percentage'])
out_tax['Number'] = numbers
out_tax['Observed Count'] = data_count
out_tax['Expected Count'] = expected_counts
out_tax['Observed Percentage'] = data_percentage
out_tax['Expected Percentage'] = BENFORD

# Exporting data to folder data under projects
out_tax.to_csv('../data/tax_data_graph.csv')

**Part 2**

Count the number of patients that came in with a condition (e.g. - mental health problem) and left with (1) none, (2) the same, or (3) with more problems. The change in health related conditions will be illustrated using bar-plots, stratified by ethnicity in R-studio.

In [207]:
# Import Client Data
my_client_original = pd.read_csv("https://raw.githubusercontent.com/biodatascience/datasci611/gh-pages/data/project2_2019/CLIENT_191102.tsv", sep = '\t')
my_disability_entry_original = pd.read_csv("https://raw.githubusercontent.com/biodatascience/datasci611/gh-pages/data/project2_2019/DISABILITY_ENTRY_191102.tsv", sep = '\t')
my_disability_exit_original = pd.read_csv("https://raw.githubusercontent.com/biodatascience/datasci611/gh-pages/data/project2_2019/DISABILITY_EXIT_191102.tsv", sep = '\t')

In [225]:
# Keep only variables of interesta and merge all datasets
my_client = my_client_original[['Client ID', 'Client Age at Entry', 'Client Age at Exit', 'Client Gender', 'Client Primary Race', 'Client Ethnicity']]
my_disability_entry['Disability Type'] = my_disability_entry_original['Disability Type (Entry)']
my_disability_entry = my_disability_entry[['Client ID','Disability Determination (Entry)', 'Disability Type']]
my_disability_exit['Disability Type'] = my_disability_exit_original['Disability Type (Exit)']
my_disability_exit = my_disability_exit[['Client ID','Disability Determination (Exit)', 'Disability Type']] 

# Merge data by client ID
my_disability = pd.merge(my_disability_entry, my_disability_exit, on=['Disability Type','Client ID'], how = 'right')
my_disability.drop_duplicates(keep=False,inplace=True)

# Flag those patients who got a disability while at the UMD
my_disability['GotDisability'] = 0
my_disability.GotDisability[(my_disability['Disability Determination (Entry)'] == 'No (HUD)') & (my_disability['Disability Determination (Exit)'] == 'Yes (HUD)')] = 1
my_disability.mean()

#Making a subset of those who got a disability 
my_disability = my_disability[my_disability.GotDisability == 1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [231]:
#Getting the socio-demographic characteristics of patients
graphs_data = pd.merge(my_disability, my_client, on = 'Client ID', how = 'left') 
graphs_data.size
graphs_data.head(100)

Unnamed: 0,Client ID,Disability Determination (Entry),Disability Type,Disability Determination (Exit),GotDisability,Client Age at Entry,Client Age at Exit,Client Gender,Client Primary Race,Client Ethnicity
0,146403,No (HUD),Mental Health Problem (HUD),Yes (HUD),1,51.0,51.0,Female,Black or African American (HUD),Non-Hispanic/Non-Latino (HUD)
1,206895,No (HUD),Developmental (HUD),Yes (HUD),1,27.0,27.0,Male,White (HUD),Non-Hispanic/Non-Latino (HUD)
2,441891,No (HUD),Both Alcohol and Drug Abuse (HUD),Yes (HUD),1,57.0,57.0,Male,Black or African American (HUD),Non-Hispanic/Non-Latino (HUD)
3,421698,No (HUD),Mental Health Problem (HUD),Yes (HUD),1,42.0,43.0,Female,White (HUD),Non-Hispanic/Non-Latino (HUD)
4,420015,No (HUD),Drug Abuse (HUD),Yes (HUD),1,55.0,55.0,Male,Black or African American (HUD),Non-Hispanic/Non-Latino (HUD)
5,157439,No (HUD),Physical (HUD),Yes (HUD),1,61.0,62.0,Female,White (HUD),Non-Hispanic/Non-Latino (HUD)
6,318721,No (HUD),Chronic Health Condition (HUD),Yes (HUD),1,57.0,58.0,Male,American Indian or Alaska Native (HUD),Non-Hispanic/Non-Latino (HUD)
7,177232,No (HUD),Mental Health Problem (HUD),Yes (HUD),1,36.0,36.0,Female,Black or African American (HUD),Non-Hispanic/Non-Latino (HUD)
8,471672,No (HUD),Mental Health Problem (HUD),Yes (HUD),1,23.0,23.0,Male,Black or African American (HUD),Non-Hispanic/Non-Latino (HUD)
9,472732,No (HUD),Both Alcohol and Drug Abuse (HUD),Yes (HUD),1,54.0,55.0,Male,Black or African American (HUD),Non-Hispanic/Non-Latino (HUD)
