# Data Science for Social Justice Workshop: Module 1

## DataFrames in Python

In the previous notebook, we learned about the fundamentals to Python. Now, we'll learn about DataFrames, which are a data structure that facilitate the interaction with **tabular data**, or data you might expect to see in Excel. Tabular data has many rows (data samples), and columns describing different aspects of those samples. 

This notebook is designed to help you:

* Load data from a `csv` file into a `pandas` dataframe.
* Query data from a dataframe based on a condition.
* Plot your findings using Seaborn.

## Manipulating DataFrames with `pandas`

`pandas` is a Python package that allows us to more easily perform complicated analysis using datasets. Think of it like a spreadsheet tool, but for programming in Python. We will use two additional packages: `numpy`, which is built for performing fast operations on arrays (effectively, more flexible lists) and `seaborn`, a visualization library for Python.

First let's import `pandas`, `numpy` and `seaborn`:

In [None]:
import pandas as pd    # pandas is conventionally shortened to pd
import numpy as np     # numpy provides useful array manipulations
import seaborn as sns  # seaborn lets us draw graphs easily
sns.set_theme()

Now, let's create a **dataframe**. Think of a dataframe like a table with columns that contain different features on a sample.

Pandas makes opening a `csv` file and seeing a summary of the data easy. We use the `read_csv()` method for this.

**Remember: you need to give the correct path (i.e., a roadmap on your filesystem) to the file you are trying to read**. If the `csv` file is in the same folder as this `ipynb` file, then you just have to give the filename.)

We'll store our dataframe as `df`, which is the conventional name for a dataframe. You can call it whatever you want, but keep it short and informative!

In [None]:
import os
# We include two ../ because we want to go two levels up in the file structure
os.chdir('../../data')

## Coding for Social Justice

Even if you’re new to programming, the ability to collect and analyze data independently means you no longer have to people in power at their word when they tell you what “the data” says. You can find data, evalute it, analyze it, and present your own findings. This will help you decide who you can trust, and help you find a voice in speaking truth to power.

Working through these kinds of exercises early on can also help you think about how you might use what you’re learning about programming to address systemic issues at a personal, local, national, and international level.

One of the most famous datasets that allows us to start thinking about social justice is the Titanic dataset. It contains information of all the passengers aboard the RMS Titanic, which unfortunately was shipwrecked. This dataset can be used to predict whether a given passenger survived or not. Specifically, the titanic dataset allows us to explore the impact of a number of demographic attributes, such as socioeconomic status, gender, and age on the likelihood of surviving the wreck.

As we go through this, I want you to reflect on one particularity of working with data about terrifying events: the ways in which it comes to seem as something "distant". These abstracted datapoints allow us to pursue all kinds of hypotheses about why passengers survived the Titanic's disaster – but what remains buried under all this analysis is the voices of these people. 

In [None]:
df = pd.read_csv('titanic.csv')
df.head()

Notice how it printed this without us even needing to use the `print()` function!

The `head()` method shows the top 5 'series' (aka rows). You can ask it to show more than 5 by passing it an integer:

In [None]:
df.head(10)

The `tail()` method works the same way, but works from the bottom instead of the top.

### Exercise 1

Using the `tail()` method, print the bottom three rows.

In [None]:
# YOUR CODE HERE


When working dataframes with loads of columns, `pandas` will hide some of the columns. Remember that you can print out all names of the columns using the `.columns` method:

In [None]:
df.columns

We interact with columns in dataframes by using lists and bracket notation:

In [None]:
# Select one column
df['Name'].head()

In [None]:
# Select multiple columns
df[['Name', 'Pclass']].head()

`pandas` also makes it easy to select a specific row in a dataframe using a conditional. For example, if we wanted to select people above the age of 60, we can use the conditional selection below.  You can additionally use `==`, `<`, `<=`, `>` or `>=` `&` or `|` for other conditional selections.

In [None]:
# Conditional selection
df[df['Age'] > 60]

That's nice, but it'd be better if the rows were ordered according to age. We can use `.sort_values()` method to sort the rows by a particular column. The arguments it takes are the name of the column (as a string), and `ascending=False` if you'd like the rows in descending roder:

In [None]:
df[df['Age'] > 60].sort_values("Age", ascending=False)

### Exercise 2

Using the `sort_values()` method, sort the fare paid in descending order.

In [None]:
# YOUR CODE HERE


## Grouping and Aggregating Data

A common operation with dataframes consists of 1) grouping samples by a column and 2) aggregating the samples by group into a useful statistic.

Using the `groupby()` method, we can group our data by passenger class. We can then aggregate each person using the `mean()` method to get the mean survival rate of each class.

In [None]:
df[['Pclass', 'Survived']].groupby(['Pclass']).mean()

Lets also include details about the passengers' sex. 

Remember: datasets are construct by humans, and thus reflect their own biases, judgements, and norms. This dataset makes no distinction between sex assignment and gender identity, and further only reflects a gender binary. 

In [None]:
df[['Sex', 'Pclass', 'Survived']].groupby(['Pclass', 'Sex']).mean()

It turns out the sex of the passenger and the class in which they travelled (Pclass) seem to correlate strongly with survival.

## Data Visualisation

Next, let's visualise our dataframes using the `seaborn` library. To improve your skills, you'll want to read `seaborn`'s [documentation](https://seaborn.pydata.org) yourself.

`seaborn` has three basic level graphing methods: `relplot()`, `distplot()` and `catplot()`, which each has a number of submethods which are basically shorthands for the main methods.

<img src="https://seaborn.pydata.org/_images/function_overview_8_0.png">

### Categorical Plots
Let's begin with a categorical plot - a one dimentional graph - which allows us to place a categorical (non-numerical) value on the x-axis. Let's also add information about whether a passenger survived using `hue=`.

What happens when you remove the `hue=` argument? What information is it providing?

In [None]:
sns.catplot(
    x="Sex",
    hue="Survived", 
    kind="count",
    data=df)

Just by observing the graph, it can be approximated that the survival rate of men is around 20% and that of women is around 75%. Therefore, whether a passenger is a male or female correlates with survival rate.

### Heatmaps

Using `seaborn`, we can also create a heatmap – a plot that shows how a continuous quantity varies as a function or two categorical quantities. This time, we'll look at the counts of passenger class versus survival:

In [None]:
group = df.groupby(['Pclass', 'Survived'])
pclass_survived = group.size().unstack()
 
# Heatmap - Color encoded 2D representation of data.
sns.heatmap(pclass_survived, annot=True, fmt="d")

This heatmap helps in determining if higher-class passengers had more survival rate than the lower class ones or vice versa. Class 1 passengers have a higher survival chance compared to classes 2 and 3. It implies that `Pclass` contributes a lot to a passenger’s survival rate.

### Violinplot

Next, let's plot whether sex and age relate to survival rate with a violin plot. The violin plot allows us to view the distribution of a continuous variable:

In [None]:
sns.violinplot(
    x="Sex",
    y="Age",
    hue="Survived",
    split=True,
    data=df)

This graph gives a summary of the age range of people who were saved. The survival rate appears to be:

- Higher for children.
- High for women in the age range 20-50.
- Lower for men as age increases.

### Barplot

Finally, let's use a barplot to have a look at whether the fare paid influences changes of survival:

In [None]:
# What does this function do? Take a look at the documentation!
df['Fare_Range'] = pd.qcut(df['Fare'], 4)
 
# Barplot - Shows approximate values based on the height of bars
sns.barplot(
    x='Fare_Range',
    y='Survived',
    data=df)

Fare denotes the fare paid by a passenger. As the values in this column are continuous, they need to be put in separate bins (as done for the Age feature) to get a clear idea. It can be concluded that if a passenger paid a higher fare, the survival rate increases.

### Exercise 3

Create a `scatterplot` to visualize the age against the fare paid.

In [None]:
# YOUR CODE HERE


## Correlations

It's not enough to just use visualization to describe our findings in data analysis. They merely *illustrate* an argument which we make with numbers. 

Can we find correlation between passenger class and survival rate? We can calculate a correlation using the `corr` method:

In [None]:
df['Survived'].corr(df['Age'])

In [None]:
df['Sex']

Next, let's look at sex and survival rate. We're going to create a new column that indicates whether the passenger was recorded as a male, and find the correlation with that column and the survival:

In [None]:
df['male'] = df['Sex'] == 'male'
df['male'].corr(df['Survived'])

So it turns out that a passenger's being a man was strongly negatively correlated with his survival aboard the Titanic, and a passenger's being older was very weakly negatively correlated with survival.

# Police Shootings Database

<font color=red>**Content warning.** The datasets in the following part of the notebook refers to people killed by police. It is highly sensitive data and you should only engage with it if you feel you are able to.</font>

Retrieving information from community-based data can and often is an extractive exercise, where not enough time is spent thinking about how to give back to the community that data is taken from. This is something to always consider when beginning a data science project. Science is never "objective" in the sense that we, as scientists, do not invoke our politics or values in practicing our work. Our activism should tie hand in hand with our work - our values drive the analyses we do - and so we must think critically about which data we use, what analyses we apply to it, and whether we should even be attempting to answer questions with that data.

At the same time, we must understand the limitations of our data, and the historical power structures that generated its patterns. For example,the inclusion of certain features can change our perception of the data. It is to think through which kinds of features could be important when trying to answer a question.

With this in mind, we turn to a police shooting database. The Washington Post maintains a database containing records of every fatal shooting in the United States by a police officer in the line of duty since Jan. 1, 2015.
See [here](https://www.washingtonpost.com/graphics/investigations/police-shootings-database/) for more information. 

This dataset contains valuable information that could be used to hold powerful institutions accountable. At the same time, sloppy or incomplete analyses can hurt communities we aim to support. Our values, and support for Black lives, should drive how we use this data.

We briefly introduce this dataset, and leave you to think critically on what questions you can answer with it.

In [None]:
df = pd.read_csv('fatal-police-shootings-data.csv')
df.head()

Each row has the following variables:

- `id`: A unique identifier for each victim.
- `name`: The name of the victim.
- `date`: The date of the fatal shooting in YYYY-MM-DD format.
- `manner_of_death`:
    - shot
    - shot and tasered
    - armed: indicates that the victim was armed with some sort of implement that a police officer believed could inflict harm
    - undetermined: it is not known whether or not the victim had a weapon
    - unknown: the victim was armed, but it is not known what the object was
    - unarmed: the victim was not armed
- `age`: The age of the victim.
- `gender`: The gender of the victim. The Post identifies victims by the gender they identify with if reports indicate that it differs from their biological sex.
    - M: Male
    - F: Female
    - None: unknown
- `race`:
    - W: White, non-Hispanic
    - B: Black, non-Hispanic
    - A: Asian
    - N: Native American
    - H: Hispanic
    - O: Other
    - None: unknown
- `city`: The municipality where the fatal shooting took place. Note that in some cases this field may contain a county name if a more specific municipality is unavailable or unknown.
- `state`: Two-letter postal code abbreviation.
- `signs_of_mental_illness`: News reports have indicated the victim had a history of mental health issues, expressed suicidal intentions or was experiencing mental distress at the time of the shooting.
- `threat_level`: The threat_level column was used to flag incidents for the [story](http://www.washingtonpost.com/sf/investigative/2015/10/24/on-duty-under-fire/) by Amy Brittain in October 2015. As described in the story, the general criteria for the attack label was that there was the most direct and immediate threat to life. That would include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. The attack category is meant to flag the highest level of threat. The other and undetermined categories represent all remaining cases. Other includes many incidents where officers or others faced significant threats.
- `flee`: News reports have indicated the victim was moving away from officers:
    - Foot
    - Car
    - Not fleeing
The threat column and the fleeing column are not necessarily related. For example, there is an incident in which the suspect is fleeing and at the same time turns to fire at gun at the officer. Also, attacks represent a status immediately before fatal shots by police while fleeing could begin slightly earlier and involve a chase.
- `body_camera`: News reports have indicated an officer was wearing a body camera and it may have recorded some portion of the incident.
- `latitude_and_longitude`: The location of the shooting expressed as WGS84 coordinates, geocoded from addresses. The coordinates are rounded to 3 decimal places, meaning they have a precision of about 80-100 meters within the contiguous U.S.
- `is_geocoding_exact`: Reflects the accuracy of the coordinates. true means that the coordinates are for the location of the shooting (within approximately 100 meters), while false means that coordinates are for the centroid of a larger region, such as the city or county where the shooting happened.

Let's first examine the distribution of police killings by race and gender:

In [None]:
sns.countplot(x="race", hue="gender",data=df)

Let's have a look at the mean age and sign of mental illness. The latter is stored in boolean yes/no values, which we can average as well.

In [None]:
df[['age','race', 'signs_of_mental_illness']].groupby(['race']).mean()

It appears mental illness was most often reported among white and asian vicitims, but the relative values are somewhat close. 

What other questions can you ask and answer with this dataset? What questions do you want answered but **cannot** answer with this dataset? What are the limitations of this dataset?

## Another Dataset: Mapping Police Violence

Finally, let's have a quick look at Mapping Police Violence, another comprehensive accounting of people killed by police since 2013. Please see the [website](https://mappingpoliceviolence.org/) for more information. 

If you’re curious about the data, it’s well worth reading the descriptions of the data collection methods, and following links to other projects that Mapping Police Violence is built on. Following these links will also lead you to a variety of projects with similar aims.

In [None]:
df = pd.read_csv('mapping_police_violence.csv')
df.head()

In [None]:
df.columns

There are a lot of columns / features here that are worthy of exploration, and I will leave it up to you to decide if you want to engage with this. 