# A Cholera Outbreak in London

## The Outbreak

In the late summer of 1849, a particularly bad outbreak of cholera struck the Soho neighborhood in central London. Between August 31 and September 10, over 500 people had died. By the end of the outbreak, the death toll was 616.

In this notebook, we will test two proposed explanations (hypotheses) of how cholera spread in London: through the **air** and through the **water**. That is, we will show that some hypotheses are likely a better fit for the data and are harder to reject, in a ***statistically significant*** way, than others. 

<br>

<table><tr>
    <td> <img src="../imgs/king_cholera.png" alt="Drawing" style="width: 600px;"/> </td>
</tr></table>

<br>

## The Theories

Two predominant theories on the cause of cholera existed at the the time:

### Airborne

<img align="right" width="300" height="300" src='../imgs/airborne.png'> 

Inhalation of a poison given off by dead or contaminated organic matter like sewage which enters the body through the lungs and poisons the blood. 

<img align="right" width="300" height="300" src='../imgs/waterborne.png'>

### Waterborne

Ingestion of “excretions of the sick” which contain a living organism which infects the gastrointestinal system.

<img align="right" width="300" height="300" src='../imgs/john_snow.jpg'>

## The Doctor

* John Snow
* Known for pioneering anesthesia techniques
* Noticed that cholera affected the gastrointestinal system
* Hypothesized that contaminated drinking water was the cause of cholera
* Most people disagreed with him

<img align="right" height="300" width="500" src = '../imgs/soho_map.jpeg'>

## The Map

* John Snow collected data for each household, including the number of deaths from cholera, during the 1854 outbreak.
* Each death is represented as a black line.
* Multiple deaths in the same household appear stacked.

## The Pump

<img align="right" width="400" src = '../imgs/broad_street.png'>

* John Snow centered his map on a particular water pump that he suspected to be the source of the ourbreak.
* The pump was located on Broad Street.
* John Snow suspected the Broad Street Pump to be the source of the outbreak.


## The Sewers

<img align="right" width="400" src = '../imgs/broad_sewers.png'>

* However, the neighborhood also had many sewers.
* Sewers were thought to be a source of cholera by many supporters of the airborne theory.
* Sewers are represented as squares in the map to the right.




## The Data

People in charge of the city’s sewers went door-to-door in the Soho neighborhood to assess the claim that toxic fumes from its sewers were causing the deaths. We have digitized this data into a .csv file that has the following columns: 

- **house_ID:** unique indtifier for the house
- **deaths:** the total deaths in that particular house 
- **dis_sewers:** distance (in meters) from the house to the nearest sewer (1 meter = 3.3 feet)
- **dis_bspump:** distance (in meters) from the house to the Broad St. pump

# Our Own Data Experiment

In the case of the airborne and waterborne theories, we can separate people into groups. The exposed group (people living near a sewer or the water pump) is often called an **impact** or **treatment group** while the unexposed group (people living far from a sewer or the water pump) is the **control group**. When testing the airborne theory, we will group people based on whether they lived near a sewer or not and whether they died of cholera or not. When testing the waterborne theory, we will group people based on whether they lived near a certain water pump or not and whether they died of cholera or not. 

This will result in four groups for each proposed explanation. We will place them in a 2x2 **contingency table** (also called a ***two-way table*** or ***crosstab***). We will have to test each explanation separately. In all, that means four contingency tables: an expected (null) and an observed table for each of the two hypotheses.

# The Airborne Hypothesis: Investigating the Sewers

Now that we've talked about how to set up our experiment, let's apply this to the cholera data! 

The first theory we will explore assumes that cholera is airborne and that people get infected by inhaling toxic fumes from localized sources. In this case, the source is fumes emitted from sewers. 

<center><img src = '../imgs/sewer.jpeg' width=400><center>

If this theory was true, then closer proximity to sewers would make it more likely to inhale the toxic air and contract cholera. For simplicity, let us assume **someone is 'close' to a sewer if they less than 40 feet (12.2 meters) from one** ... otherwise they are 'far'. Unfortunately, we don't have the total number of people in each house. That data was not collected.  Therefore, we will have to count houses instead of people.

A contingency table simply shows the total frequencies of each variable, with one variable appearing on each axis. It technically does not matter, but a common approach is to put the independent (explanatory) variable on the x-axis and the dependent (outcome) variable on the y-axis. While there are libraries to create contingency tables for us, we will build some ourselves in order to better understand it. Here is the contingency table for the airborne theory:

<img src="../imgs/sewers_observed.png" style="width: 600px;"/>

### Predict


Even though we will test the airborne theory by assuming the null hypothesis is true, if there was support for the alternative hypothesis (i.e., there is an association between proximity to a sewer and cholera), what do you predict the observed contingency table will look like? In other words, which of the four groups would have an unusually high number of counts?

> Write your answer here! 

### Building the Observed Contingency Table

We will now build a contingency table for what was actually observed during the outbreak using the following variable names. 

- observed number of houses near a sewer with a death from cholera: `obs_near_sewer_deaths`
- observed number of houses near a sewer without a death from cholera: `obs_near_sewer_nondeaths`
- observed number of houses far from a sewer with a death from cholera: `obs_far_sewer_deaths`
- observed number of houses far from a sewer without a death from cholera: `obs_far_sewer_nondeaths`

**Calculate the number of houses that were near a sewer and had people that died of cholera.**

**Calculate the number of houses that were near a sewer that did not have a death from cholera.**

**Calculate the number of houses that were far from a sewer and had people that died from cholera.**

**Calculate the number of houses that were far from a sewer that did not have deaths from cholera.**

**Summarize the numbers.**

In [None]:
print(f"Observed houses near a sewer with a death: {obs_near_sewer_deaths}")
print(f"Observed houses near a sewer without a death: {obs_near_sewer_nondeaths}")
print(f"Observed houses far from a sewer with a death: {obs_far_sewer_deaths}")
print(f"Observed houses far from a sewer without a death: {obs_far_sewer_nondeaths}")

**The following code will build a blank contingency table and label the columns and rows. Run the cell.**

In [None]:
def visualize_contingency_table(contingency_table, top_labels, left_labels):
    # print("\t\t  Close | Far ")
    print('{:<15s} {:<15s} {:<20s}'.format(top_labels[0], top_labels[1], top_labels[2]))

    i = 0
    for line in contingency_table:
        print('{:<15s} {:<15d} {:<20d}'.format(left_labels[i], contingency_table[i][0], contingency_table[i][1]))
        i += 1
    
top_labels = [" ", "Near Sewer", "Far from Sewer"]
left_labels = ["Deaths", "Non Deaths", "Total"]


**This cell will insert your calculated values for the observed data in the contingency table. Run the cell.**

In [None]:
obs_contingency_table = [
    [obs_near_sewer_deaths, obs_far_sewer_deaths],
    [obs_near_sewer_nondeaths, obs_far_sewer_nondeaths]
] 

print("Observed Contingency Table:")
visualize_contingency_table(obs_contingency_table, top_labels, left_labels)

### Building the Null Contingency Table

Testing the airborne hypothesis requires us to compare what was observed (the contingency table we just constructed) with what we would expect if there was **no relationship** between the sewers and cholera (the null hypothesis). 

In order to do this, we need to construct another contingency table that shows the expected counts under the null hypothesis.  

We can do this by using the row and column totals (the yellow boxes) from the previous contingency table to calculate the 2x2 table (the green boxes). 

<center><img src="../imgs/sewer_contingency.jpeg" style="width: 800px;"/><center>

### Calculating Totals

Calculate the row and column totals and store the values in the following variables:

- total houses that had a death from cholera: `total_deaths`
- total houses that did not have a death from cholera: `total_nondeaths`
- total houses that were near a sewer: `total_near_sewer`
- total houses that were ./far from a sewer: `total_far_sewer`
- total number of houses: `total_houses`

**Summarize the data:**

In [None]:
print(f"Number of houses with a death: {total_deaths}")
print(f"Number of houses without a death: {total_nondeaths}")
print(f"Number of houses near a sewer: {total_near_sewer}")
print(f"Number of houses far from a sewer: {total_far_sewer}")
print(f"Total number of houses: {total_houses}")

### Calculating the Expected Counts

We can use the row and column totals to calculate the expected counts using he following equation:

$$expected \ value = {row \ total \times column \ total \over grand \ total}$$


**Calculate the expected counts in the contingency table and store the values in the following variables:**

- expected number of houses near a sewer with a death from cholera: `exp_near_sewer_deaths`
- expected number of houses near a sewer without a death from cholera: `exp_near_sewer_nondeaths`
- expected number of houses far from a sewer with a death from cholera: `exp_far_sewer_deaths`
- expected number of houses far from a sewer without a death from cholera: `exp_far_sewer_nondeaths`

In [None]:
print(f"Expected houses near a sewer with a death: {exp_near_sewer_deaths}")
print(f"Expected houses near a sewer without a death: {exp_near_sewer_nondeaths}")
print(f"Expected houses far from a sewer with a death: {exp_far_sewer_deaths}")
print(f"Expected houses far from a sewer without a death: {exp_far_sewer_nondeaths}")

**Run the following code to insert our data in the table and display it.**

In [None]:
exp_contingency_table = [
    [exp_near_sewer_deaths, exp_far_sewer_deaths],
    [exp_near_sewer_nondeaths, exp_far_sewer_nondeaths]
] 

print("Expected (Null) Contingency Table:")
visualize_contingency_table(exp_contingency_table, top_labels, left_labels)

**Does there appear to be a significant difference between the expected numbers under the null hypothesis and the observed numbers?** 

> Write your answer here! 

### Calculating the p-value

Even if there is a difference between the expected and observed contingency tables, is it large enough to reject the null hypothesis and accept the alternative that living close to a sewer is associated with higher cholera rates? The method for testing statistical significance in contingency tables is called a "chi-squared ($Chi^2$) analysis". 

There is library called "SciPy" that has a function that will do the chi-squared analysis for us.

In [None]:
from scipy.stats import chi2_contingency

Even though we have calculated the expected values manually, this process has been built into the chi-squared function in SciPy. Therefore, we only have to pass the observed contingency table to the function.

The `chi2_contingency` function returns 4 values. We are only interested in the p-value. When doing data science in Python, it is common convention to use `_` characters to mark variables whose values we don't need. 

In [None]:
_, p_value, _, _ = chi2_contingency(obs_contingency_table)
print(f"p-value: {p_value:.2f}")

**Based on the p-value of your $Chi^2$ test, can you reject the null hypothesis (there is no association between living close to a sewer and dying from cholera) and accept the alternative (there is an association between living close to a sewer and dying from cholera)?**

> Write your answer here! 

# The Waterborne Hypothesis: Investigating the Broad Street Pump

Next, we want to explore the theory that cholera was transmitted through contaminated water. At the time, John Snow guessed that the water of a particular pump, the Broad Street Pump (BSP, for short), might have carried pieces of poisonous sewage. Did the data support this hypothesis? 

<center><img src="../imgs/pump3.jpeg" alt="Drawing" style="width: 300px;"/><center>

If this theory was true, then closer proximity to the Broad Street Pump would make it more likely to drink its contaminated water and contract cholera. For simplicity, let us assume **someone is 'close' to the Broad Street Pump if they are at most 460 feet (140 meters) from it**... otherwise they are 'far'.



Here is the contingency table for the waterborne theory with totals along the bottom and right side:

<img src="../imgs/pump_contingency.jpeg" style="width: 800px;"/>

**Using the same approach as you applied to test the airborne (sewer) hypothesis, conduct a test of the waterborne hypothesis by constructing contingency tables for the alternative (waterborne) and null hypotheses and then conducting a $Chi^2$ test.**

**Based on the p-value of your $Chi^2$ test, can you reject the null hypothesis (there is no association between living close to the Broad Street Pump and dying from cholera) and accept the alternative (there is an association between living close to the Broad Street Pump and dying from cholera)?**

> Write your answer here! 

# Visualizing the Data

<img align='right' src="../imgs/funny_paper.jpeg" alt="Drawing" style="width: 400px;"/>

An important part of data science is not only determining statistical significance of hypotheses, but also communicating your findings to people without a statistics background. 

Imagine reading a newspaper headline (like below) that says ’The p-value was below 0.05’... the average person does not know what this means! Visualizing your results is an important step in convincing others that your evidence is compelling! In the following, we create (and interpret) data visualizations that make it easier to understand your statistical results.

We first explore a **histogram** -- a type of bar graph used to show differences in the frequency (or count) of various events. (In this case, the events are deaths and non-deaths of people close and far from the sewer). 

In [None]:
# Histogram

# Let's calculate the percentages of deaths that are 'close' versus 'far'. 
# Close Deaths + Far Deaths should sum to 1 
# (then we can do the same for non-deaths)
obs_near_sewer_deaths_pct = obs_near_sewer_deaths / (obs_near_sewer_deaths + obs_far_sewer_deaths) * 100
obs_far_sewer_deaths_pct = 100 - obs_near_sewer_deaths_pct

obs_near_sewer_nondeaths_pct = obs_near_sewer_nondeaths / (obs_near_sewer_nondeaths + obs_far_sewer_nondeaths) * 100
obs_far_sewer_nondeaths_pct = 100 - obs_near_sewer_nondeaths_pct



# 1. Let's first view the CLOSE deaths vs nondeaths. 
plt.bar(x=['Deaths Near', 'Deaths Far'], 
        height=[obs_near_sewer_deaths_pct, obs_far_sewer_deaths_pct], color='purple', label='Deaths')

# 2. Let's first view the FAR deaths vs nondeaths.
plt.bar(x=['Nondeaths Near', 'Nondeaths Far'], 
        height=[obs_near_sewer_nondeaths_pct, obs_far_sewer_nondeaths_pct], color='gold', label='Nondeaths')
plt.ylim((0,100))
plt.ylabel("Percentage of Deaths or Nondeaths")
plt.title("Deaths and Nondeaths (Close and Far from Sewer)")
plt.legend()

**Now we can do the same for comparing deaths and nondeaths of people near and far from the Broad St. Pump.**

In [None]:
obs_near_pump_deaths_pct = obs_near_pump_deaths / (obs_near_pump_deaths + obs_far_pump_deaths) * 100
obs_far_pump_deaths_pct = 100 - obs_near_pump_deaths_pct

obs_near_pump_nondeaths_pct = obs_near_pump_nondeaths / (obs_near_pump_nondeaths + obs_far_pump_nondeaths) * 100
obs_far_pump_nondeaths_pct = 100 - obs_near_pump_nondeaths_pct
 
plt.bar(x=['Deaths Near', 'Deaths Far'], 
        height=[obs_near_pump_deaths_pct, obs_far_pump_deaths_pct], color='purple', label='Deaths')

plt.bar(x=['Nondeaths Near', 'Nondeaths Far'], 
        height=[obs_near_pump_nondeaths_pct, obs_far_pump_nondeaths_pct], color='gold', label='Nondeaths')
plt.ylim((0,100))
plt.ylabel("Percentage of Deaths or Nondeaths")
plt.title("Deaths and Nondeaths (Close and Far from Broad Street Pump)")
plt.legend()

### The 3 Second Rule

The 3 Second Rule states that one gets 3 secons to grab someone's attention and flag the take-home point of a data visualization: https://stephanieevergreen.com/the-3-second-rule/

**What does these histograms communicate to you? Does they follow the 3 Second Rule?** 


> Write your answer here! 