# Lab: Hate crimes

In the session 7 (week 8) we discussed data and society: discourses on the social, political and ethical aspects of data science. We also discussed how one can responsibly carry out data science research on social phenomena, what ethical and social frameworks can help us to critically approach data science practices and its effects on society, and about ethical practices for data scientists.

## Datasets 

This week we will work with the following datasets:

- **[Hate crimes](https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/)**
- **[World Bank Indicators Dataset](https://databank.worldbank.org/source/world-development-indicators#)**
- **Office for National Statistics (ONS)**: [Gender Pay Gap](https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/annualsurveyofhoursandearningsashegenderpaygaptables) 


### Further datasets
- **[OECD Poverty gap](https://data.oecd.org/inequality/poverty-rate.htm)**
- **Poverty & Equity Data Portal**: From [Organisation for Economic Co-operation and Development (OECD)](https://data.oecd.org/inequality/income-inequality.htm#indicator-chart) or from the [WorldBank](https://povertydata.worldbank.org/poverty/home/)
- **NHS**: multiple files. The NHS inequality challenge https://www.nuffieldtrust.org.uk/project/nhs-visual-data-challenge 
  - [Health state life expectancies by Index of Multiple Deprivation](https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthinequalities/) (IMD 2015 and IMD 2019): England, all ages
multiple publications


### Additional Readings
- **Indicators - critical reviews**: The Poverty of Statistics and the Statistics of Poverty: https://www.tandfonline.com/doi/full/10.1080/01436590903321844?src=recsys
- **Indicators in global health**: arguments: indicators are usually comprehensible to a small group of experts. Why use indicators then? "Because indicators used in global HIV finance offer openings for engagement to promote accountability (...) some indicators and data truly are better than others, and as they were all created by humans, they all can be deconstructed and remade in other forms" Davis, S. (2020). The Uncounted: Politics of Data in Global Health, Cambridge. doi:10.1017/9781108649544

## Hate Crimes 

In this notebook we will be using  the [Hate Crimes dataset from Fivethirtyeight](https://github.com/fivethirtyeight/data/tree/master/hate-crimes), which was used in the story [Higher Rates Of Hate Crimes Are Tied To Income Inequality](https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/).

### Variables:
| Header | Definition |
| --- | --- |
| `NAME` | State name |
| `median_household_income` | Median household income, 2016 |
 | `share_unemployed_seasonal` | Share of the population that is unemployed (seasonally adjusted), Sept. 2016 | 
 | `share_population_in_metro_areas` | Share of the population that lives in metropolitan areas, 2015 | 
 | `share_population_with_high_school_degree` | Share of adults 25 and older with a high-school degree, 2009 | 
 | `share_non_citizen` | Share of the population that are not U.S. citizens, 2015 | 
 | `share_white_poverty` | Share of white residents who are living in poverty, 2015 | 
 | `gini_index` | Gini Index, 2015 | 
 | `share_non_white` | Share of the population that is not white, 2015 | 
 | `share_voters_voted_trump` | Share of 2016 U.S. presidential voters who voted for Donald Trump | 
 | `hate_crimes_per_100k_splc` | Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016 | 
 | `avg_hatecrimes_per_100k_fbi` | Average annual hate crimes per 100,000 population, FBI, 2010-2015 | 

::: aside

**Gini Index:** measures income inequality. `Gini Index` values can range between `0` and `1`, where 0 indicates perfect equality and everyone has the same income, and 1 indicates perfect inequality. You can read more about Gini Index here: <https://databank.worldbank.org/metadataglossary/world-development-indicators/series/SI.POV.GINI>

![Map of income inequality Gini coefficients by country (%). Based on World Bank data for 2019. Source: [Wikipedia](https://en.wikipedia.org/wiki/Gini_coefficient)](media/Gini_Coefficient_of_Wealth_Inequality_source_(2019).png){width=300px}
:::

## Data exploration


::: callout-note

### Select the IM939 environment before you begin
We will again be using Altair and Geopandas this week. If you are using the course's virtual environment, this should be installed for you the first time you set up your environment for the module. Refer to @sec-setup for instructions on how to set up your environment.

:::

In [None]:
#| echo: false
# Remove warnings for this notebook.
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
df = pd.read_excel('data/hate_Crimes_v2.xlsx')

A reminder: anything with a pd. prefix comes from pandas (since pd is the alias we have created for the pandas library). This is particulary useful for preventing a module from overwriting inbuilt Python functionality.

Let's have a look at our dataset

In [None]:
# Retrieve the last ten rows of the df dataframe
df.tail(10)

In [None]:
# Is df indeed a DataFrame, let's do a quick check
type(df)

In [None]:
# What about the data type (Dtype) of the columns in df. We better also be aware of these to help us understand about manipulating them effectively
df.info()

### Missing values
Let's explore the dataset

The output above shows that we have some missing data for some of the states, as also shown by df.tail(10) earlier on. Let's check again. 

In [None]:
df.isna().sum()

Hmmm, the column 'share_non_citizen' does indeed have some missing data. How about the column NAME (state names). Is that looking ok?

In [None]:
import numpy as np
np.unique(df.NAME)

There aren't any unexpected values in 'NAME' for the USA

In [None]:
# And how many states do we have the data for?
count_states = df['NAME'].nunique()
print(count_states)

Oh...one extra state! Which one is it? And is it a state? Even if you don't get into investigating this immediately, if you realise that this entry is a particularly interesting one down the line in your analysis, you may wish to dig deeper into the context! 

## Mapping hate crime across the USA

In [None]:
# We need  the geospatial polygons of the states in America 
# You can remind yourself about shape polygons from the lab material last week too
import geopandas as gpd 
import altair as alt

# Read geospatial data as geospatial data frame -gdf
gdf = gpd.read_file('data/gz_2010_us_040_00_500k.json')
gdf.head()

In [None]:
# Confirm what type geo_states is...
type(gdf)

As with the previous week, we have got a column called geometry that would allow us to project the data in a 2D map format. 

In [None]:
# Calling the Altair alias (alt) to help us create the map of USA - more technically speaking, 
# creating a Chart object using Altair with the following properties
alt.Chart(gdf, title='US states').mark_geoshape().encode(
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

In [None]:
# Add the data
# You might recall that the df DataFrame and the geostates GeoDataFrame both have a NAME column
# Revisit the merge concept again (from last week and earlier weeks to refresh your memory)

geo_states = gdf.merge(df, on='NAME')

geo_states.head()


In [None]:
# Let's take a look at the merged GeoDataFrame
geo_states.head(10)

Let's start making our visualisations and see if we can spot any trends or patterns

In [None]:
# Let's first check how hate crimes looked pre-election 
chart_pre_election = alt.Chart(geo_states, title='PRE-election Hate crime per 100k').mark_geoshape().encode(
    color='avg_hatecrimes_per_100k_fbi',
    tooltip=['NAME', 'avg_hatecrimes_per_100k_fbi']
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)
chart_pre_election

As you can see above, Altair has chosen a color for each US state based on the range of values in the `avg_hatecrimes_per_100k_fbi` column. We have also created a tooltip, so hover over the map and check the crime rates. Which ones are particularly high? average? low? 

Also, is there a dark spot between Virginia and Maryland - Did you notice that? What's happening there? Remember: Context always matters for data analysis!

In [None]:
# Ok, what about the post election status?
chart_post_election = alt.Chart(geo_states, title='POST-election Hate crime per 100k').mark_geoshape().encode(
    color='hate_crimes_per_100k_splc',
    tooltip=['NAME', 'hate_crimes_per_100k_splc']
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)
chart_post_election

In [None]:
#| column: page

# Perhaps we can arrange the maps side-by-side to compare better?
pre_and_post_map = chart_pre_election | chart_post_election
pre_and_post_map

::: callout-note

Oh, what's happening here? We better investigate:

1. Identify why the maps (particularly) one of them looks so different now. Go back to the original maps to check. Go back to the description of the variables to check. Also, visit the source of the article.

2. Once you have identified the issue, can you find a way to address the issue?

:::

### Exploring data

In [None]:
#| column: page
import seaborn as sns
sns.pairplot(data = df.iloc[:,1:])

The above plot may be hard to read without squinting our eyes (and take a bit longer to run on some devices), but it's definitely worth a closer look if you are able to. Check the histograms along the diagonal - what do they show about the distribution of each variable. For example, what does the `gini_index` distribution tell us? With respect to the scatter plots, some are more random while others show likely positive or negative correlations. You may wish to investigate what's happening! And, you might remember (as we also discussed in the video recordings this week), correlation != causation!

In [None]:
# Let's take a look at the income range in the country
df.boxplot(column=['median_household_income'])

In [None]:
# And the average hatecrimes based on FBI data next (also, average over what? check the Variables description again to remind you if need be)
df.boxplot(column=['avg_hatecrimes_per_100k_fbi'])

We may want to drop some states (remove them). See more [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html). 

Let us drop Hawaii (which is one of the states outside mainland USA)

In [None]:
# Let's find out the index value of the state in the DataFrame df
df[df.NAME == 'Hawaii']

In [None]:
# Let's look at a summary of numeric columns prior to dropping Hawaii
df.describe()

In [None]:
# Let's now drop Hawaii
df = df.drop(df.index[11])

In [None]:
# Now check again for the statistical summary
df.describe()

There seems to be some changes.

In [None]:
# Let's dig in deeper to the correlation between median household income and hatecrimes based on FBI data
df.plot(x = 'avg_hatecrimes_per_100k_fbi', y = 'median_household_income', kind='scatter')

In [None]:
# And the relationship between median household income and hatecrimes based on SPLC data
df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')

Hmmm, there doesn't appear to be a strong (linear!) correlation, but surely there is a cluster and some outliers! That's our cue - let's find out which states might be outliers by using the standard deviation function 'std'.

In [None]:
df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]

Remember that we discussed about 'context' earlier when we got 51 states? If your investigation paid off there, you can make better sense of the outliers here.

In [None]:
# Let's try to make the outliers more obvious
import matplotlib.pyplot as plt
outliers_df = df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
df.plot(x = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')
plt.scatter(outliers_df.hate_crimes_per_100k_splc, outliers_df.median_household_income ,c='red')

In [None]:
# Let's create a pivot table to focus on specific columns of interest
df_pivot = df.pivot_table(index=['NAME'], values=['hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi', 'median_household_income'])
df_pivot

In [None]:
# the pivot table seems sorted by state names, let's sort by FBI hate crime data instead 
df_pivot.sort_values(by=['avg_hatecrimes_per_100k_fbi'], ascending=False)

In [None]:
# Let's standardise our data before we attempt further modelling using the data
from sklearn import preprocessing
import numpy as np

# Get the column names first
df_selected_std = df[['median_household_income','share_unemployed_seasonal', 'share_population_in_metro_areas'
               , 'share_population_with_high_school_degree', 'share_non_citizen', 'share_white_poverty', 'gini_index'
               , 'share_non_white', 'share_voters_voted_trump', 'hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi']]
names = df_selected_std.columns

# Create the Scaler object for standardising the data
scaler = preprocessing.StandardScaler()

# Fit our data on the scaler object
df2 = scaler.fit_transform(df_selected_std)

# check what type is df2
type(df2)

In [None]:
# Let's convert the numpy array into a DataFrame before further processing
df2 = pd.DataFrame(df2, columns=names)
df2.tail(10)

In [None]:
#| column: page-inset-right

# Now that our data has been standardised, let's look at the distribution across all columns
ax = sns.boxplot(data=df2, orient="h")

In [None]:
import scipy.stats
# Let's create a correlation matrix by computing the pairwise correlation of numerical columns, rounding correlation values to two places
corrMatrix = df2.corr(numeric_only=True).round(2)
print (corrMatrix)

::: callout-warning

### Time for reflection:

Look at the positive and negative correlation values above. What do they suggest and how strong, weak or moderate is the correlation. 

:::

In [None]:
# Let's create a heatmap to visualse the pairwise correlations for better understanding
corrMatrix = df2.corr(numeric_only=True).round(1)  #Rounding to (1) so it's easier to read given number of variables
sns.heatmap(corrMatrix, annot=True)
plt.show()

In [None]:
# Let's now perform a linear regression on our data
# Try the commented code after you run this first
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# identify the independent variables
x = df2[['median_household_income', 'share_population_with_high_school_degree', 'share_voters_voted_trump']]
# identify the dependent variable
y = df2[['avg_hatecrimes_per_100k_fbi']]
# What if we change the data (y variable)
#y = df2[['hate_crimes_per_100k_splc']]

lin_model = LinearRegression(fit_intercept = True) 
lin_model.fit(x, y)

print("Coefficients:", lin_model.coef_)
print ("Intercept:", lin_model.intercept_)

# Generate predictions from our linear regression model, and check the MSE, Rsquared and Variance measures to assess performance
y_hat = lin_model.predict(x)
print ("MSE:", metrics.mean_squared_error(y, y_hat))
print ("R^2:", metrics.r2_score(y, y_hat))
print ("var:", y.var())

What do these values suggest?

::: callout-note

Remember the earlier note about the maps looking different? Were you able to identify what's going there? 

Let's revisit that part again

:::

The maps did not share comparable scales. The first map was showing the annual crime rate per 100k residents, while the second map was showing the total incident numbers per 100k resident only for the 10-days following the 2016 election. How can we fix this discrepency? 

::: callout-warning

### Your Turn

Tweak the `??`s below to visualise changes

:::

In [None]:
#| error: true

# We can generete two new "features" and add them to the DataFrame df 
df['hate_crimes_per_100k_splc_perday'] = df['hate_crimes_per_100k_splc'] / ??
# the 'avg_hatecrimes_per_100k_fbi' column is an annual incidence average between 2010- 15, so each data value is the number of incidences (per 100k residents) in an average year. 
df['avg_hatecrimes_per_100k_fbi_perday'] = df['avg_hatecrimes_per_100k_fbi'] / ???

In [None]:
#| error: true

# Update geo_states
geo_states = geo_states.merge(df, on='????')

In [None]:
# Let's plot again
# First the PRE election map
pre_election_map = alt.Chart(geo_states, title='PRE-election Hate crime per 100k per day').mark_geoshape().encode(
    alt.Color('avg_hatecrimes_per_100k_fbi_perday', scale=alt.Scale(domain=[0, 0.15])),
    tooltip=['NAME', 'avg_hatecrimes_per_100k_fbi_perday']
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

post_election_map = alt.Chart(geo_states, title='POST-election Hate crime per 100k per day').mark_geoshape().encode(
    alt.Color('hate_crimes_per_100k_splc_perday', scale=alt.Scale(domain=[0, 0.15])),
    tooltip=['NAME', 'hate_crimes_per_100k_splc_perday']
).properties(
    width=500,
    height=300
).project(
    type='albersUsa'
)

new_combined_map = pre_election_map | post_election_map

How is that now?