
# Part 1: Seeing the Problem as Data


### Data from the 1854 London cholera outbreak

Cholera is a disease that we now know how to treat but before this treatment was discovered, only 1 in 2 infected people had a chance to survive. At that time, in mid-19th century London, doctors, health inspectors, pastors and many others desperately tried to find the cause of cholera to stop people from dying. One of them was **John Snow**, a physician who started recording the following information about London households:

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/img-ledger.png" alt="Ledger" align=left style="width: 500;"/> 

<!-- As you can see, there are data about the following (in order):
- House number
- Neighborhood
- Occupation
- Age
- Symptoms
- Water Supplier -->

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4>**Journal 1b:** Taking inventory: what do we have?</font>

**Describe the data you see in John Snow's notes (the photo). What types of data are present? Are there other types of data you would want to collect if you were trying to find out how cholera is transmitted?** 

> - location
> - daily activities (occupation)
> - demographics (age)
> - presenting concerns (symptoms)
> - possible source of infection (water supplier)

**Now we will embark on our own journey to trace the origins of the Cholera epidemic, but instead by using a 21st century data science toolkit!**
<br><br>
In the next cell, please add your information to get started! 
<br><br>

In [1]:
# Change this to be your name! 
first_name = "Andrew"
last_name = "Chang-DeWitt"

# print is the simplest Python function -- it allows us to view our data! 
print(f">>> Hello world, my name is {first_name} {last_name}!")

>>> Hello world, my name is Andrew Chang-DeWitt!


## 1.2: Representing Data on the Computer

In this **Jupyter notebook**, we will begin our data science journey via the **Python programming language**. 

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/python-image.png" alt="Drawing" style="width: 400px;"/>


**By the end of this notebook, you should be able to**: 
- Represent data from the real world in Python
- See how data are stored, grouped, and organized for data science
- Understand and manipulate variables and lists
<br><br>

**Data representation in Python**
The following contains some of the data from above represented in Python code. All of the *{}* and *:* symbols aside, this should map somehow to John Snow's notes shown above. 

In [3]:
person_0 = {"house_number": 7, 
            "neighborhood": "Layton's Buildings",
            "date": "July 29", 
            "occupation": "tailor",
            "age": 20, 
            "symptoms": "cholera 17 hours",
            "water_supplier": "Southwark & Vauxhall"}

person_1 = {"house_number": 2, 
            "neighborhood": "Dobb's Cross",
            "date": "July 30", 
            "occupation": "son of a shop-keeper",
            "age": 10, 
            "symptoms": "cholera Asiatic 24 hours",
            "water_supplier": "Southwark & Vauxhall"}

person_2 = {"house_number": 81, 
            "neighborhood": "Ann Street",
            "date": "July 29", 
            "occupation": "son of a labourer",
            "age": 12, 
            "symptoms": "cholera 8 hours",
            "water_supplier": "Southwark & Vauxhall"}

person_3 = {"house_number": 12, 
            "neighborhood": "Layton's Buildings",
            "date": "July 31", 
            "occupation": "tailor",
            "age": 20, 
            "symptoms": "cholera 17 hours",
            "water_supplier": "Southwark & Vauxhall"}

# We can 'group' all of these data together by using a 'list'!
people_list = [person_0, person_1, person_2, person_3]
print(f"Our list of people: {people_list}")


Our list of people: [{'house_number': 7, 'neighborhood': "Layton's Buildings", 'date': 'July 29', 'occupation': 'tailor', 'age': 20, 'symptoms': 'cholera 17 hours', 'water_supplier': 'Southwark & Vauxhall'}, {'house_number': 2, 'neighborhood': "Dobb's Cross", 'date': 'July 30', 'occupation': 'son of a shop-keeper', 'age': 10, 'symptoms': 'cholera Asiatic 24 hours', 'water_supplier': 'Southwark & Vauxhall'}, {'house_number': 81, 'neighborhood': 'Ann Street', 'date': 'July 29', 'occupation': 'son of a labourer', 'age': 12, 'symptoms': 'cholera 8 hours', 'water_supplier': 'Southwark & Vauxhall'}, {'house_number': 12, 'neighborhood': "Layton's Buildings", 'date': 'July 31', 'occupation': 'tailor', 'age': 20, 'symptoms': 'cholera 17 hours', 'water_supplier': 'Southwark & Vauxhall'}]


<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 1c:** The data whisperer...</font>

**First, add "person_3" to our_people list in the above cell (and re-run the cell). Is there anything unusual about how 'counting' is done in Python?** 

> Write your answer here! 

It starts from 0.


## 1.3: Let's *slow down* and smell the... data types?

So now we beg the question... ***what was actually going on in the previous code?***

In this section, we are going to see if we can understand the basic units of Python data storage: 
- types
- variables
- lists
- lists of lists


--------------
There are several **types** in Python that can be used to represent data values. I outline a handful of types in the following:


| Type | Description | Examples |
| :-- | :-- | :-- |
| String ('str') | A sequence of characters; stored within "" | "cat", "London", "27" |
| Integer ('int')| Whole positive or negative numbers *without* decimal points | -50, 0, 27 |
| Float ('float') | Real numbers that *can* have multiple decimal points | -50.0, 0.75, 3.14159 |
| List ('list') | A sequence of any Python data type; stored within \[ \] | \["a", "b", "c"\] |

<br><br>
We can see the type of data in Python by using the `type()` function as shown in the following cell: 

In [3]:
# Types
print(type("London"))
print(type(37))
print(type("37"))
print(type(37.0))
print(type(['a', 'b', 'c']))

<class 'str'>
<class 'int'>
<class 'str'>
<class 'float'>
<class 'list'>


<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 1d:** Learning about types!</font>

**In the following cell, write some example lines that answer the following questions:**

c1. **What is the type of a string that uses '' instead of ""?**
> `<class 'str'>`

c2. **What is the type of "0.5"?**
> `<class 'float'>`

c3. **What happens if we turn "London" into London (no quotes) and try to run the cell?**
> ```
> ---------------------------------------------------------------------------
> NameError                                 Traceback (most recent call last)
> Cell In[4], line 3
>       1 print(type("test"))
>       2 print(type(0.5))
> ----> 3 London
> 
> NameError: name 'London' is not defined
> ```

A **variable** is a Python storage container for a **single type** of data. For instance, in algebra class, you might be familiar with math equations like the following:  

$2x=50$

... where **$x$** is an **integer variable** that represents some value ***(which is???)***. 

In Python, we can set variables to be almost anything that we want. Earlier in this notebook, for instance, we used two string variables to represent your first and last names using two string variables: 
```
first_name = "John"
last_name = "Snow"
```

## 1.4: Expanding our Data Science Toolbox with 'lists of lists'. 

As mentioned before, a **list** in Python is a sequence of other Python types, such as `[1, 2, 3, 4, 5]` or `['a', 'b', 'c']`. Interestingly, you can also create a list of lists in Python to create **tabular** data, or data in row-column format. 

We can very easily rewrite our data into a 'list of lists' format. See as follows...


In [4]:

headers = ["house_number", "neighborhood", "date", "occupation", "age", "symptoms", "water_supplier"]
person_list_0 =  [7, "Layton's Buildings", "July 29", "tailor", 
                  20, "cholera 17 hours", "Southwark & Vauxhall"]

person_list_1 = [2, "Dobb's Cross", "July 30", "son of a shop-keeper", 
                 10, "cholera Asiatic 24 hours", "Southwark & Vauxhall"]

person_list_2 = [81, "Ann Street", "July 29", "son of a labourer",
                 12, "cholera 8 hours", "Southwark & Vauxhall"]

person_list_3 = [12, "Layton's Buildings", "July 31", "tailor", 
                 20, "cholera 17 hours", "Southwark & Vauxhall"]

person_2d_list = [person_list_0, person_list_1, person_list_2, person_list_3]
print(f"This is our list of lists: {person_2d_list}")

This is our list of lists: [[7, "Layton's Buildings", 'July 29', 'tailor', 20, 'cholera 17 hours', 'Southwark & Vauxhall'], [2, "Dobb's Cross", 'July 30', 'son of a shop-keeper', 10, 'cholera Asiatic 24 hours', 'Southwark & Vauxhall'], [81, 'Ann Street', 'July 29', 'son of a labourer', 12, 'cholera 8 hours', 'Southwark & Vauxhall'], [12, "Layton's Buildings", 'July 31', 'tailor', 20, 'cholera 17 hours', 'Southwark & Vauxhall']]


In order to store, explore, and manipulate these **tabular** data, we use **Pandas** --- one of the most well-known data science toolkits for Python. 

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pandas.svg.png" alt="Drawing" style="width: 400px;"/>

**We will use lots of Pandas in this class!** For now, let's load the Pandas library into Python and put our list-of-lists into a Pandas **Dataframe**. 


In [6]:
import pandas as pd

df = pd.DataFrame.from_records(person_2d_list, columns=headers)
df

Unnamed: 0,house_number,neighborhood,date,occupation,age,symptoms,water_supplier
0,7,Layton's Buildings,July 29,tailor,20,cholera 17 hours,Southwark & Vauxhall
1,2,Dobb's Cross,July 30,son of a shop-keeper,10,cholera Asiatic 24 hours,Southwark & Vauxhall
2,81,Ann Street,July 29,son of a labourer,12,cholera 8 hours,Southwark & Vauxhall
3,12,Layton's Buildings,July 31,tailor,20,cholera 17 hours,Southwark & Vauxhall


# Part 2: Exploring the Data I


## 2.1: Seeing the Problem in Data

Finding the origins of cholera's spread was a contentious issue in the 1800s. Before even seeing the problem in the data, the people of that time could see cholera **all around them**. As friends and relatives grew gravely ill, it became urgent to discover **why** this was happening in hopes of putting a stop to it. 

People, including the local media, had different ideas about what could be causing cholera. For instance, take a look at the following political cartoon of the time: 
<br>

<table><tr>
    <td> <img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/king_cholera.png" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>

<br>

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2a:** Interpret the Cartoon </font>

**What is the underlying message of this cartoon?** 

> Write your answer here! 

The cholera epidimic was coming from unclean conditions on the streets.

-------------------------------------------------------------------------------------------------

### Data Science 101: Finding the Problem

Data scientists have curious minds; when confronted with a problem in the 'real world' they first try to better understand that problem with data (before, of course, looking for answers and solutions). 

After all, it's a lot easier for superheroes to solve a mystery when there's a signal illuminating what is driving the problem. 

<table><tr>
    <td> <img src="imgs/bat-signal.jpeg" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>

<br><br>

Using the Pandas toolkit, let's see if we can dig into our data to gain any insight as to ***when*** cholera outbreaks have occurred. The data below show, for a given year, how many people lived in London and how many people died there (of any cause).

In [8]:
# Load data about London
London = pd.read_csv("London.csv", index_col='Year')
London

Unnamed: 0_level_0,Population,Deaths
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1840,1842458,46281
1841,1877963,45284
1842,1916860,45272
1843,1953787,48574
1844,2033816,50423
1845,2073298,48332
1846,2113535,49089
1847,2195401,60442
1848,2238703,57628
1849,2282858,68432


<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2b:** Thinking about the Data </font>

**In 1-3 sentences, comment on what you see in these data. Is there enough here to determine when cholera outbreaks occurred? Why or why not?** 

> Write your answer here!

I see a relatively constant death total across most of the years presented. There are spikes, however, in 1847-1849, then again in 1853-1854. While it is clear something happened during those times to increase the death total beyond typical ranges, it is not clear what the cause was from this data alone.

### The importance of normalization 

'Deaths' are higher in 1854 than in 1840. Is this because of cholera? *Maybe*. Is this because of population growth? *Possibly*. Simply put, if you have more people in a city, then more people die. 

For instance, if `40,000` people die in Chicago each year (pop. 3,000,000), and only `50` people die in the small farm town of Lonsdale, MN each year (pop. 4,000) ... then are you `40,000 / 50 = 800` times less likely to die in Lonsdale?!


<table><tr>
    <td> <img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/chicago.jpeg" alt="Drawing" style="width: 400px;"/> </td>
    <td> <img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/lonsdale.jpeg" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>

This example highlights the importance of ***normalization***: adjusting the values of data so that they are on the **same scale**  – in this case, so you can compare the chance of dying in the much larger city of Chicago vs. the much smaller town of Lonsdale. 


<br><br>
<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2c**: How to Normalize
    
**For the London data, how might you normalize to find years when cholera is particularly fatal?**</font>

> Write your answer here!

Calculate a death rate, normalizing the total deaths against the total population.

<br><br>

## 2.2: Creating an Outcome Variable
An **outcome variable** is the variable that we want to explain using other variables! You can **normalize an outcome variable** to avoid the population-level pitfalls discussed earlier. 

Let's return to the London example... 

Since we are interested in reasons why people die of cholera, `Deaths` seems like a logical choice for our outcome variable! 

BUT different years have different populations, which leads us to the "NYC-vs-Lonsdale" dilemma from before... 

We can easily normalize `Deaths` by using the `Population` variable to create a `Death Rate`, also known as "Mortality Rate".

This is done with the following calculation:
$$death \ rate_{1000} = {deaths \over population} \times 1000$$

**>>> Exercise:** complete the following cell: 

In [10]:
# Calculate mortality rate per 100k. 
# The "/" means that we divide every item in the "Deaths" column by every item in the "Population" column.
# The "*" means that we multiply every value (for our new Outcome variable) by 1000. 
London['Deaths per 1000'] = London["Deaths"] / London["Population"] * 1000
London

Unnamed: 0_level_0,Population,Deaths,Deaths per 1000
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1840,1842458,46281,25.119161
1841,1877963,45284,24.113361
1842,1916860,45272,23.617792
1843,1953787,48574,24.861461
1844,2033816,50423,24.792312
1845,2073298,48332,23.311651
1846,2113535,49089,23.226017
1847,2195401,60442,27.531189
1848,2238703,57628,25.741691
1849,2282858,68432,29.976459


<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2d**: The magic of 'per'
    
**Explain in your own words why `Deaths per 1000` is a better outcome variable than `Deaths`?**</font>

> Write your answer here! 

It accounts for changes in total population that could affect the death total.

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2e**: Identifying the Outbreak(s)
    
**What years, based on your new outcome variable, are there cholera outbreaks?**</font>

> Write your answer here!

Possibly in 1847, 1849, & 1854.

----------------------------------------------

## Part 3: Pandas Dataframe Manipulations

-------------------------

In the previous Pandas dataframe, we created a new column by combining the other columns in some meaningful way. In this section, we will practice dataframe manipulations. To begin, let's look at some data about that cholera outbreak from 1849. 

19th Century London was divided into districts, much like Chicago is divided into neighborhoods. These districts were grouped by geography, just like Chicago (South Side, North Side, West side, Far South Side, etc.).

<br>

<img src="https://i2.wp.com/londontopia.net/wp-content/uploads/2014/08/london-county.jpg" width=600>

<br>

In [14]:
# Load data about London
outbreak = pd.read_csv("https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/JohnSnowDataSets/The_Outbreak_of_1849.csv")
outbreak

Unnamed: 0,District,Region,Population (1851),Area (Acres),Elevation,"Houses, Inhabited",Average value of house,Deaths from Cholera (1849)
0,Bermondsey,South,48128,688,0,5674,18,734
1,Bethnal Green,East,90193,760,38,11782,9,789
2,Camberwell,South,54667,4342,4,6843,25,504
3,Chelsea,West,56538,865,12,5648,29,247
4,City of London,Central,55932,434,31,7921,117,207
5,Clerkenwell,Central,64778,315,65,6946,33,121
6,East London,Central,44406,153,40,4796,38,182
7,Greenwich,South,99305,5367,8,11995,22,718
8,Hackney,North,58429,3929,53,7192,25,139
9,Hampstead,North,11986,2252,350,1411,40,9


### Action 1. Column statistics. 
-------------------------------------
As data scientists, we often want to know *something* about entire columns of data. 

Pandas provides a number of utilities to compute column **aggregates**, as follows: 

- **min**: find the minimum value(s) of a column. 
- **max**: find the maximum value(s) of a column. 
- **mean**: compute the average of the column. 
- **sum**: add up elements in a column. 

In [15]:
deaths_min = outbreak['Deaths from Cholera (1849)'].min()
deaths_max = outbreak['Deaths from Cholera (1849)'].max()
deaths_mean = outbreak['Deaths from Cholera (1849)'].mean()
deaths_sum = outbreak['Deaths from Cholera (1849)'].sum()

print(f"deaths min: {deaths_min}\ndeaths max: {deaths_max}\ndeaths mean: {deaths_mean}\ndeaths sum: {deaths_sum}")

deaths min: 9
deaths max: 1618
deaths mean: 392.3333333333333
deaths sum: 14124


### Action 2. Creating a new column. 
------------------------------------------
Let's say that you want to create a new column in Pandas. This can be done by setting a column to a value by using the `df["column_name"] = ...` notation. Let's create a new column where we calculate the number of people per house using the following calculation: 

$$people \ per \ house = {population \over houses_{inhabited}}$$

In [16]:
# A. Create a new column that calculates the number of people per house. 
outbreak['People per House'] = outbreak['Population (1851)']/outbreak['Houses, Inhabited']
outbreak

Unnamed: 0,District,Region,Population (1851),Area (Acres),Elevation,"Houses, Inhabited",Average value of house,Deaths from Cholera (1849),People per House
0,Bermondsey,South,48128,688,0,5674,18,734,8.4822
1,Bethnal Green,East,90193,760,38,11782,9,789,7.655152
2,Camberwell,South,54667,4342,4,6843,25,504,7.988748
3,Chelsea,West,56538,865,12,5648,29,247,10.010269
4,City of London,Central,55932,434,31,7921,117,207,7.06123
5,Clerkenwell,Central,64778,315,65,6946,33,121,9.325943
6,East London,Central,44406,153,40,4796,38,182,9.258966
7,Greenwich,South,99305,5367,8,11995,22,718,8.278866
8,Hackney,North,58429,3929,53,7192,25,139,8.124166
9,Hampstead,North,11986,2252,350,1411,40,9,8.494685


### Action 3. Selecting columns.
-----------------------------------
Let's say that you want to 'select' only certain dataframe columns. You can select just one column using `df["column_name"]` or multiple columns as follows `df[["column_name_1", "column_name_2"]]`. See the following...

In [17]:
# Select only the district column. 
new_df = outbreak["District"]
new_df

0                    Bermondsey
1                 Bethnal Green
2                    Camberwell
3                       Chelsea
4                City of London
5                   Clerkenwell
6                   East London
7                     Greenwich
8                       Hackney
9                     Hampstead
10                      Holborn
11                    Islington
12                   Kensington
13                      Lambeth
14                     Lewisham
15                   Marylebone
16                    Newington
17                      Pancras
18                       Poplar
19                  Rotherhithe
20                   Shoreditch
21    St. George Hanover Square
22         St. George Southwark
23       St. George-in-the-East
24                    St. Giles
25        St. James Westminster
26                     St. Luke
27     St. Martin-in-the-Fields
28         St. Olave, Southwark
29       St. Saviour, Southwark
30                      Stepney
31      

In [18]:
# Select the "District" and "Region" columns. Notice the double brackets because we are putting a list in outbreak[...]
new_df = outbreak[["District", "Region"]]
new_df

Unnamed: 0,District,Region
0,Bermondsey,South
1,Bethnal Green,East
2,Camberwell,South
3,Chelsea,West
4,City of London,Central
5,Clerkenwell,Central
6,East London,Central
7,Greenwich,South
8,Hackney,North
9,Hampstead,North


### Action 3. Selecting rows.
-----------------------------------
Selecting certain rows is a different process because rows in a dataframe can contain different data types. When we filter rows, we just want to see the rows that contain a certain value or range of values.

We use what is called a "boolean" which is a "true/false statement" and is coded like this: `df[(df["column_name"] == "some_value")]`

The double equal signs `==` means "is it equal to?" as opposed to `=` which means "**make** it equal to" like how we set variable equal to a certain value.

For example, let's say that we only want to see the row for East London...

In [19]:
new_row = outbreak[(outbreak["District"] == "East London")]
new_row

Unnamed: 0,District,Region,Population (1851),Area (Acres),Elevation,"Houses, Inhabited",Average value of house,Deaths from Cholera (1849),People per House
6,East London,Central,44406,153,40,4796,38,182,9.258966


You can also use other **"operators"** like: 
- `>` greater than
- `<` less than
- `<=` less than or equal to 
- `>=` greater than or equal to
- `!=` not equal to

If we wanted to select the districts that are below an elevation of 20:

In [20]:
low_elev = outbreak[(outbreak["Elevation"] < 20)]
low_elev

Unnamed: 0,District,Region,Population (1851),Area (Acres),Elevation,"Houses, Inhabited",Average value of house,Deaths from Cholera (1849),People per House
0,Bermondsey,South,48128,688,0,5674,18,734,8.4822
2,Camberwell,South,54667,4342,4,6843,25,504,7.988748
3,Chelsea,West,56538,865,12,5648,29,247,10.010269
7,Greenwich,South,99305,5367,8,11995,22,718,8.278866
13,Lambeth,South,139325,4015,3,17791,28,1618,7.831207
16,Newington,South,64816,624,-1,9370,22,907,6.917396
18,Poplar,East,47162,2918,8,5066,44,313,9.309514
19,Rotherhithe,South,17805,886,0,2420,23,352,7.357438
22,St. George Southwark,South,51824,282,0,6663,22,836,7.777878
28,"St. Olave, Southwark",South,19375,169,4,2523,35,349,7.67935


### Applying Dataframe Manipulations to the Problem

One of the first things we can do when exploring data for insights is to see if there are **spatial patterns** to our outcome vaiable. In other words, does location affect the outcome?

Let's apply what we just learned about dataframe manipulations to see if cholera's impact was spatial.

Perform the following dataframe manipulation exercises! Fill in the "???" with the proper Python code!

**1. Print out a version of `outbreak` containing only the district, region, population and deaths columns. Call this dataframe `outbreak_spatial`.**

In [21]:
# Put your answer here!
outbreak_spatial = outbreak[["District", "Region", "Population (1851)", "Deaths from Cholera (1849)"]]
outbreak_spatial

Unnamed: 0,District,Region,Population (1851),Deaths from Cholera (1849)
0,Bermondsey,South,48128,734
1,Bethnal Green,East,90193,789
2,Camberwell,South,54667,504
3,Chelsea,West,56538,247
4,City of London,Central,55932,207
5,Clerkenwell,Central,64778,121
6,East London,Central,44406,182
7,Greenwich,South,99305,718
8,Hackney,North,58429,139
9,Hampstead,North,11986,9


**2. Create a column called "Deaths per 1000" that is the mortality rate.  *Hint: see outcome variable*.**

In [22]:
# Put your answer here! 
outbreak_spatial["Deaths per 1000"] = \
    outbreak_spatial["Deaths from Cholera (1849)"] \
    / outbreak_spatial["Population (1851)"] \
    * 1000
outbreak_spatial

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outbreak_spatial["Deaths per 1000"] = \


Unnamed: 0,District,Region,Population (1851),Deaths from Cholera (1849),Deaths per 1000
0,Bermondsey,South,48128,734,15.250997
1,Bethnal Green,East,90193,789,8.747907
2,Camberwell,South,54667,504,9.219456
3,Chelsea,West,56538,247,4.368743
4,City of London,Central,55932,207,3.700923
5,Clerkenwell,Central,64778,121,1.867918
6,East London,Central,44406,182,4.098545
7,Greenwich,South,99305,718,7.23025
8,Hackney,North,58429,139,2.378956
9,Hampstead,North,11986,9,0.750876


**3. Select only the rows for the North districts and put it in a new dataframe called `outbreak_north`**

In [25]:
outbreak_north = outbreak_spatial[(outbreak_spatial["Region"] == "North")]
outbreak_north

Unnamed: 0,District,Region,Population (1851),Deaths from Cholera (1849),Deaths per 1000
8,Hackney,North,58429,139,2.378956
9,Hampstead,North,11986,9,0.750876
11,Islington,North,95329,187,1.961628
15,Marylebone,North,157696,261,1.655083
17,Pancras,North,166956,360,2.156257


**4. Repeat Step 3 for the other four regions.** Make a new dataframe for each: `outbreak_south`, `outbreak_east`, etc.

In [26]:
outbreak_south = outbreak_spatial[(outbreak_spatial["Region"] == "South")]
outbreak_south

Unnamed: 0,District,Region,Population (1851),Deaths from Cholera (1849),Deaths per 1000
0,Bermondsey,South,48128,734,15.250997
2,Camberwell,South,54667,504,9.219456
7,Greenwich,South,99305,718,7.23025
13,Lambeth,South,139325,1618,11.613135
14,Lewisham,South,34835,96,2.755849
16,Newington,South,64816,907,13.993458
19,Rotherhithe,South,17805,352,19.769728
22,St. George Southwark,South,51824,836,16.131522
28,"St. Olave, Southwark",South,19375,349,18.012903
29,"St. Saviour, Southwark",South,35731,539,15.08494


In [27]:
outbreak_west = outbreak_spatial[(outbreak_spatial["Region"] == "West")]
outbreak_west

Unnamed: 0,District,Region,Population (1851),Deaths from Cholera (1849),Deaths per 1000
3,Chelsea,West,56538,247,4.368743
12,Kensington,West,120004,206,1.716609
21,St. George Hanover Square,West,73230,131,1.788884
25,St. James Westminster,West,36406,57,1.565676
27,St. Martin-in-the-Fields,West,24640,91,3.693182
34,Westminster,West,65609,437,6.660672


In [28]:
outbreak_east = outbreak_spatial[(outbreak_spatial["Region"] == "East")]
outbreak_east

Unnamed: 0,District,Region,Population (1851),Deaths from Cholera (1849),Deaths per 1000
1,Bethnal Green,East,90193,789,8.747907
18,Poplar,East,47162,313,6.636699
20,Shoreditch,East,109257,830,7.596767
23,St. George-in-the-East,East,48376,199,4.11361
30,Stepney,East,110775,501,4.522681
35,Whitechapel,East,79759,506,6.344112


**5. Calculate the average (mean) death rate for each region.**

In [30]:
death_rate_south = outbreak_south["Deaths per 1000"].mean()
print(f"South: {death_rate_south}")

South: 12.59968677572774


In [None]:
death_rate_north = outbreak_???[???].???
death_rate_south = outbreak_???[???].???
death_rate_east = outbreak_???[???].???
death_rate_west = outbreak_???[???].???
death_rate_central = outbreak_???[???].???

print(f"North: {death_rate_north}\South: {death_rate_south}\East: {death_rate_east}\West: {death_rate_west}\Central: {death_rate_central}")

To help visualize the spatial patterns, here is a map of death rates by boroughs, which are similar to districts:

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/1849_map.png" align=left style="width: 400px;"/>


### 2.3 Reflection

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 2e**: Is it spatial?
    
**Based on the death rates, for the different regions, do you think that cholera had spatial patterns?** If so, which region(s) of London were most impacted?</font>

> Write your answer here! 

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Last step: Save and Submit your work!** </font>