## <font color='Black'>-></font> Are all races equally cited according to local demographics?

First, we need to get the appropriate data and create some common data structures to store the data.

In [36]:
#List of races as listed in the population data set
races_full = ['White','Hispanic','Asian','Black','Other']

#Two dataframes to be used to store the population percentages and citation counts of all areas
citations = pd.DataFrame(0.0, index= ['White','Hispanic','Asian','Black','Other'],columns={})
populations = pd.DataFrame(0.0, index= ['White','Hispanic','Asian','Black','Other'], columns={})

This function extracts the population demographics and citation information and merges it into a common
dataframe for a year and police district code.

In [37]:
#Function: get the percentage of each race being cited and population demographics for that area
#Params: the year and area to get the citation percentages from
def get_citation_percent_v_population(year, area):
    
    #Getting the appropriate data
    df_temp = df_years[year][area][0]
    df_temp = df_temp[['W','H','A','B','O']].loc[df_temp.index == 'citation']
    df_new = pd.DataFrame(df_years[year][area][1]['Percent'])
    
    #Adding the citation data to the population percent dataframe
    for i in range(0,5):
        df_new.set_value(races_full[i],'Citation',df_temp.iat[0,i])
    
    #Merging smaller racial groups with little data to Other
    df_new.set_value('Other','Percent',df_new.iat[5,0] + df_new.iat[6,0] + df_new.iat[7,0] + df_new.iat[8,0])
    df_new = df_new.drop(['Two or More','American Indian','Pacific Islander','Total Population'])
    
    #Finding the total citation count and converting citation values into percentages
    total_val = df_new.iat[4,1] + df_new.iat[3,1] + df_new.iat[2,1] + df_new.iat[1,1] + df_new.iat[0,1]
    for i in range(0,5):
        df_new.set_value(races_full[i],'Citation', df_new.iat[i,1]/total_val * 100)
        
    return df_new

This next function will allow us to sum the citation percentages and population percentages for all of
San Diego county for all years we have data from. This function also allows for us to compute average population demographics and citation rates across all of San Diego. We will then be able to do a chi-squared test to test how related the population demographics are to citation percentages. 

In [38]:
#Function: get citation counts and population percentages from all years and areas and sum their counts in a dataframe
def citation_for_all():
    
    #Create a new empty dataframe to store our general info
    df_total = pd.DataFrame(0.0, index= ['White','Hispanic','Asian','Black','Other'], columns={'Citation','Population'})
    
    #Loop through all the police areas, years, and races; get their citation percentages and population percentages
    for area in areas_with_data:
        for year in years_list:
            df_citate = get_citation_percent_v_population(year,area)
            
            #Sum the percentages in the total dataframe, add the separate values to the citation and population dataframes
            for i in range (0,5):
                df_total.set_value(df_total.index[i],'Citation', df_citate.iat[i,1] + df_total.iat[i,1])
                citations.set_value(citations.index[i],area, df_citate.iat[i,1])
                df_total.set_value(df_total.index[i],'Population', df_citate.iat[i,0] + df_total.iat[i,0])
                populations.set_value(populations.index[i],area,df_citate.iat[i,0])
                
    #Get the total percentage numbers
    total_val1 = df_total.iat[4,1] + df_total.iat[3,1] + df_total.iat[2,1] + df_total.iat[1,1] + df_total.iat[0,1]
    total_val2 = df_total.iat[4,0] + df_total.iat[3,0] + df_total.iat[2,0] + df_total.iat[1,0] + df_total.iat[0,0] 
    
    #Get the overall population and citation percentage of each race in San Diego
    for i in range(0,5):
        df_total.set_value(races_full[i],'Citation', df_total.iat[i,1]/total_val1 * 100)
        df_total.set_value(races_full[i],'Population', df_total.iat[i,0]/total_val2 * 100)
    
    #Create a column of % difference values between the overall citation and population percentages
    for i in range(0,5):
        df_total.set_value(races_full[i],'Differences', ((df_total.iat[i,1] - df_total.iat[i,0])))
    
    #Do a chi-squared test to see how population and citation compare
    q_val, p_val = stats.chisquare(df_total['Population'],df_total['Citation'], axis=None)
    
    #Print the result of the chi-squared test
    if p_val < 0.01:
        print('There is a significant difference in percentages!')
    else:
        print('No significant difference in percentages overall.')
    
    return df_total

Having dealt with all the data as a whole, we now want to make comparisons between each race specifically. With
differences between each population and citation pair of data for each race in each police district. The
between_races function will create a dataframe of the citation-population differences for all races and areas.

In [39]:
#Function: creates a dataframe of the differences between citation and population percentages for all areas
#Params: dataframe of citation percentages, dataframe of population percentages
def between_races(df_citations,df_populations):
    
    #Create a empty dataframe to store our difference values
    df_differences = pd.DataFrame(0.0, index= ['White','Hispanic','Asian','Black','Other'],columns={})
    
    #Loop through the citations and populations dataframes and set the difference's values
    for indexO in range(0, len(areas_with_data)):
        for indexI in range(0, len(df_differences.index)):
            df_differences.set_value(races_full[indexI],areas_with_data[indexO], 
                                     df_citations.iat[indexI,indexO] - df_populations.iat[indexI,indexO])
    
    return df_differences

The next two functions allow for us to compare the difference scores between races across all races, and 
determine whether they are significant using chi-squared tests.

In [40]:
#Function: prints out the result of a chi-squared test between two race citation-population differences
#Params: the dataframe of differences, and names of two races of interest
def chi_results(df_differences, race1, race2):
    
    #Get the information for each race
    df_1 = df_differences.loc[df_differences.index == race1]
    df_2 = df_differences.loc[df_differences.index == race2]
    
    #Compute the chi-squared value for the dataframes values
    q, p_val = stats.chisquare(df_1,df_2, axis=None)
    
    #Print out the names of the races if their correlation is significant
    if p_val < 0.01:
        print(str(race1) + ' & ' + str(race2))

In [41]:
#Function: calls the chi_results method for all race-race comparisons
def chi_squared_comparisons():
    # Some print statements which will look nice
    print()
    print("But do there exist differences between individual races?")
    print("* These comparisons are made with matrices of difference scores from each district")
    print()
    #Compute chi-squared values for each of the following race pairs
    chi_results(differences,'Asian','Black')
    chi_results(differences,'Asian','Hispanic')
    chi_results(differences,'Asian','Other')
    chi_results(differences,'Asian','White')
    chi_results(differences,'Black','Hispanic')
    chi_results(differences,'Black','Other')
    chi_results(differences,'Black','White')
    chi_results(differences,'Hispanic','Other')
    chi_results(differences,'Hispanic','White')
    chi_results(differences,'White','Other')

Finally, we can make the calculations, and see the results of the analyses.

In [42]:
#Get the general data
df_show = citation_for_all()

#Plot the general differences between citation percentages and population percentages
plt.bar(df_show.index,df_show['Differences'])
plt.title('% Differences between population and citation across Race')
plt.ylabel('% of total citations - total population %')

#Compute chi-squared tests on differences between races
chi_squared_comparisons()

NameError: name 'df_years' is not defined

According to this analysis, there are no significant differences between the general population averages and citation
averages, which is suggestive of an overall lack of bias in citation practices across all races which can be seen
visually averaging the bars between the races and getting near zero differences between population and citations. However, when each race is taken separately, a couple significant differences are found between races (some of which are also consistent with the general data). Here you can see that asians and blacks have a near 10% differential in citation rate in the general graph, similar to asians & others,and whites & others; while other effects are found by the specific area by area difference calculation, they do hold generally across the San Diego population as a whole. This could be suggestive of local effects. On the flipside, some data appears to be significantly different visually on the general graph but fails to hold at an individual basis, suggesting that the general statistic may be skewed to some extent as well. It is worth considering the part of these citation biases may be related to the role of San Diego as a 
sanctuary city for refugees and a starting place for many immigrants in the U.S.A., where certain districts may have a
larger police presence due to new populations who are adjusting to our laws and language, one such area is City Heights
which is a community made up of mostly immigrants, refugees, asylees, etc. and having been to that neighbourhood, it is very easy to notice a heightened police presence. So the effects we may see may not be indicative of some racial bias in policing but of other situations going on, where in the case of such communities, the police serve as an educational institution for newcomers to learn the law of the land who may not know English or understand how to live in a different culture. Which brings us to privacy and ethics.