# A Cholera Outbreak in London

## The Outbreak

In the late summer of 1849, a particularly bad outbreak of cholera struck the Soho neighborhood in central London. Between August 31 and September 10, over 500 people had died. By the end of the outbreak, the death toll was 616.

In this notebook, we will test two proposed explanations (hypotheses) of how cholera spread in London: through the **air** and through the **water**. That is, we will show that some hypotheses are likely a better fit for the data and are harder to reject, in a ***statistically significant*** way, than others. 

<br>

<table><tr>
    <td> <img src="https://github.com/jdomyancich/big-data-camp/blob/main/imgs/king_cholera.png?raw=true" alt="Drawing" style="width: 600px;"/> </td>
</tr></table>

<br>

## The Theories

Two predominant theories on the cause of cholera existed at the the time:

### Airborne

<img align="right" width="300" height="300" src='https://github.com/jdomyancich/big-data-camp/blob/main/imgs/airborne.png?raw=true'> 

Inhalation of a poison given off by dead or contaminated organic matter like sewage which enters the body through the lungs and poisons the blood. 

<img align="right" width="300" height="300" src='https://github.com/jdomyancich/big-data-camp/blob/main/imgs/waterborne.png?raw=true'>

### Waterborne

Ingestion of “excretions of the sick” which contain a living organism which infects the gastrointestinal system.

<img align="right" width="300" height="300" src='https://github.com/jdomyancich/big-data-camp/blob/main/imgs/john_snow.jpg?raw=true'>

## The Doctor

* John Snow
* Known for pioneering anesthesia techniques
* Noticed that cholera affected the gastrointestinal system
* Hypothesized that contaminated drinking water was the cause of cholera
* Most people disagreed with him

<img align="right" height="300" width="500" src = 'https://github.com/jdomyancich/big-data-camp/blob/main/imgs/soho_map.jpeg?raw=true'>

## The Map

* John Snow collected data for each household, including the number of deaths from cholera, during the 1854 outbreak.
* Each death is represented as a black line.
* Multiple deaths in the same household appear stacked.

## The Pump

<img align="right" width="400" src = 'https://github.com/jdomyancich/big-data-camp/blob/main/imgs/broad_street.png?raw=true'>

* John Snow centered his map on a particular water pump that he suspected to be the source of the ourbreak.
* The pump was located on Broad Street.
* John Snow suspected the Broad Street Pump to be the source of the outbreak.


## The Sewers

<img align="right" width="400" src = 'https://github.com/jdomyancich/big-data-camp/blob/main/imgs/broad_sewers.png?raw=true'>

* However, the neighborhood also had many sewers.
* Sewers were thought to be a source of cholera by many supporters of the airborne theory.
* Sewers are represented as squares in the map to the right.




## The Data

People in charge of the city’s sewers went door-to-door in the Soho neighborhood to assess the claim that toxic fumes from its sewers were causing the deaths. We have digitized this data into a .csv file that has the following columns: 

- **house_ID:** unique indtifier for the house
- **deaths:** the total deaths in that particular house 
- **dis_sewers:** distance (in meters) from the house to the nearest sewer (1 meter = 3.3 feet)
- **dis_bspump:** distance (in meters) from the house to the Broad St. pump

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

house_data = pd.read_csv('https://raw.githubusercontent.com/jdomyancich/big-data-camp/refs/heads/main/data/deaths_by_house.csv')
house_data.head()

Unnamed: 0,house_ID,deaths,dis_sewers,dis_bspump
0,1,0,10.08,125.0
1,2,1,14.64,119.94
2,3,0,18.47,116.27
3,4,0,22.98,112.56
4,5,0,27.47,109.1


# Our Own Data Experiment

In the case of the airborne and waterborne theories, we can separate people into groups. The exposed group (people living near a sewer or the water pump) is often called an **impact** or **treatment group** while the unexposed group (people living far from a sewer or the water pump) is the **control group**. When testing the airborne theory, we will group people based on whether they lived near a sewer or not and whether they died of cholera or not. When testing the waterborne theory, we will group people based on whether they lived near a certain water pump or not and whether they died of cholera or not. 

This will result in four groups for each proposed explanation. We will place them in a 2x2 **contingency table** (also called a ***two-way table*** or ***crosstab***). We will have to test each explanation separately. In all, that means four contingency tables: an expected (null) and an observed table for each of the two hypotheses.

# The Airborne Hypothesis: Investigating the Sewers

Now that we've talked about how to set up our experiment, let's apply this to the cholera data! 

The first theory we will explore assumes that cholera is airborne and that people get infected by inhaling toxic fumes from localized sources. In this case, the source is fumes emitted from sewers. 

<center><img src = 'https://github.com/jdomyancich/big-data-camp/blob/main/imgs/sewer.jpeg?raw=true' width=400><center>

If this theory was true, then closer proximity to sewers would make it more likely to inhale the toxic air and contract cholera. For simplicity, let us assume **someone is 'close' to a sewer if they less than 40 feet (12.2 meters) from one** ... otherwise they are 'far'. Unfortunately, we don't have the total number of people in each house. That data was not collected.  Therefore, we will have to count houses instead of people.

A contingency table simply shows the total frequencies of each variable, with one variable appearing on each axis. It technically does not matter, but a common approach is to put the independent (explanatory) variable on the x-axis and the dependent (outcome) variable on the y-axis. While there are libraries to create contingency tables for us, we will build some ourselves in order to better understand it. Here is the contingency table for the airborne theory:

<img src="https://github.com/jdomyancich/big-data-camp/blob/main/imgs/sewers_observed.png?raw=true" style="width: 600px;"/>

### Building the Observed Contingency Table

We will now build a contingency table for what was actually observed during the outbreak using the following variable names. 

#### Using Conditional Logic: `np.where()`

This allows you to create values based on conditions. For example, we can categorize houses as near or far from a sewer and create a new column with this information. We will use 12.2 meters as the cutoff for the coindition:

In [23]:
###
house_data['sewer_proximity'] = np.where(house_data['dis_sewers'] > 12.2, "Far from Sewer", "Near Sewer")
house_data.head()
###

Unnamed: 0,house_ID,deaths,dis_sewers,dis_bspump,sewer_proximity,death_status
0,1,0,10.08,125.0,Near Sewer,No Deaths
1,2,1,14.64,119.94,Far from Sewer,Deaths
2,3,0,18.47,116.27,Far from Sewer,No Deaths
3,4,0,22.98,112.56,Far from Sewer,No Deaths
4,5,0,27.47,109.1,Far from Sewer,No Deaths


We can use `np.where` to label whether a house had a death or not.

In [53]:
###
house_data['death_status'] = np.where(house_data['deaths'] == 0, "Non-Deaths", "Deaths")
house_data.head()
###

Unnamed: 0,house_ID,deaths,dis_sewers,dis_bspump,sewer_proximity,death_status
0,1,0,10.08,125.0,Near Sewer,Non-Deaths
1,2,1,14.64,119.94,Far from Sewer,Deaths
2,3,0,18.47,116.27,Far from Sewer,Non-Deaths
3,4,0,22.98,112.56,Far from Sewer,Non-Deaths
4,5,0,27.47,109.1,Far from Sewer,Non-Deaths


Now we have the data (the four categories) to build our contingency table. Contingency tables are also known as a **two-way table** or **crosstabulation**. Fortunately, Pandas has a function called `crosstab()` that will construct the table for us:

In [54]:
###
sewer_table = pd.crosstab(house_data["death_status"], house_data["sewer_proximity"])
sewer_table
###

sewer_proximity,Far from Sewer,Near Sewer
death_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Deaths,252,117
Non-Deaths,1047,436


**Does there appear to be a significant difference in the incidence of deaths between houses that are near a sewer vs. far from a sewer?** 

We can get a better sense of the difference between the groups by calculating the `Deaths` and `Non-Deaths` as percentages using the `normalize` argument in the `crosstab` function.

In [55]:
###
norm_sewer_table = pd.crosstab(house_data["death_status"], house_data["sewer_proximity"], normalize='columns') * 100
norm_sewer_table
###

sewer_proximity,Far from Sewer,Near Sewer
death_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Deaths,19.399538,21.157324
Non-Deaths,80.600462,78.842676


### Calculating the p-value

Even if there is a difference between the two groups of houses, is it large enough to support that living close to a sewer is associated with higher cholera rates and not just a difference caused by randomness? 

The method for testing statistical significance in contingency tables is called a "chi-squared ($Chi^2$) analysis". 

There is library called "SciPy" that has a function that will do the chi-squared analysis for us.

In [51]:
from scipy.stats import chi2_contingency

The `chi2_contingency` function returns 4 values. We are only interested in the p-value. When doing data science in Python, it is common convention to use `_` characters to mark variables whose values we don't need. 

In [52]:
###
_, p_value, _, _ = chi2_contingency(sewer_table)
print(f"p-value: {p_value:.2f}")
###

p-value: 0.42


**Based on the p-value of your $Chi^2$ test, what are the chances that the higher death rate in houses near a sewer is due to random chance?**

# The Waterborne Hypothesis: Investigating the Broad Street Pump

Next, we want to explore the theory that cholera was transmitted through contaminated water. At the time, John Snow guessed that the water of a particular pump, the Broad Street Pump (BSP, for short), might have carried pieces of poisonous sewage. Did the data support this hypothesis? 

<center><img src="https://github.com/jdomyancich/big-data-camp/blob/main/imgs/pump3.jpeg?raw=true" alt="Drawing" style="width: 300px;"/><center>

If this theory was true, then closer proximity to the Broad Street Pump would make it more likely to drink its contaminated water and contract cholera. For simplicity, let us assume **someone is 'close' to the Broad Street Pump if they are at most 140 meters from it**... otherwise they are 'far'.


<img src="https://github.com/jdomyancich/big-data-camp/blob/main/imgs/pumps_observed.png?raw=true" style="width: 600px;"/>

In [56]:
house_data['pump_proximity'] = np.where(house_data['dis_bspump'] <= 140, "Near the Pump", "Far from the Pump")
house_data.head()

Unnamed: 0,house_ID,deaths,dis_sewers,dis_bspump,sewer_proximity,death_status,pump_proximity
0,1,0,10.08,125.0,Near Sewer,Non-Deaths,Near the Pump
1,2,1,14.64,119.94,Far from Sewer,Deaths,Near the Pump
2,3,0,18.47,116.27,Far from Sewer,Non-Deaths,Near the Pump
3,4,0,22.98,112.56,Far from Sewer,Non-Deaths,Near the Pump
4,5,0,27.47,109.1,Far from Sewer,Non-Deaths,Near the Pump


**Print the contingency table.**

In [57]:
pump_table = pd.crosstab(house_data["death_status"], house_data["pump_proximity"])
pump_table

pump_proximity,Far from the Pump,Near the Pump
death_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Deaths,162,207
Non-Deaths,1284,199


In [58]:
pump_table = pd.crosstab(house_data["death_status"], house_data["pump_proximity"], normalize='columns') * 100
pump_table

pump_proximity,Far from the Pump,Near the Pump
death_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Deaths,11.20332,50.985222
Non-Deaths,88.79668,49.014778


### Is it significant?

In [60]:
_, p_value, _, _ = chi2_contingency(pump_table)
print(f"p-value: {p_value:.10f}")

p-value: 0.0000000031


**Based on the p-value of your $Chi^2$ test, what are the chances there is no association between living close to the Broad Street Pump and dying from cholera?**

> Write your answer here! 