# MCB 50: Immunity & Disease

**Estimated Time**: 30-40 minutes <br>
**Databook created by**: Rucha Kelkar, Harry Li, Elias Saravia

Today we will be examining a dataset (ie. a table) and a few graphs on the MRSA bacterial infection. This notebook will challenge you to analyze data and use your knowledge of immunity and disease to form conclusions about MRSA. The notebook will also serve as a gentle introduction to Jupyter Notebooks (the platform you are currently using)  and Python (a coding language).

### Table of Contents
1. [Intro to Jupyter](#0) <br>
1. [Intro to Python](#1) <br>
1. [Intro to MRSA](#2) <br>
1. [Intro to Dataset](#3) <br>
1. [Comparing Infection Rates Over Time](#4) <br>
1. [Comparing County Infection Rates with County Populations](#5) <br>
1. [Bonus Challenge](#6)
1. [Submit Your Work](#7)

# 1. Intro to Jupyter <a id='0'></a>

This webpage is a Jupyter Notebook. Jupyter Notebooks run on the Python coding language and provides an interactive interface for students. We will use this notebook to analyze a dataset on MRSA bloodstream infections in California hospitals. Jupyter Notebooks are composed of both regular text and code cells. Code cells have a gray background. In order to run a code cell, click the cell and press Shift + Enter while the cell is selected or hit the ▶| Run button in the toolbar at the top. You can also save your work using the button on the top left hand corner.

An example of a code cell is shown below. The contents of this cell set up the notebook by importing pre-written Python packages that we will be using to read in, clean, analyze, and model our data. Run it, and once it is done, you should see "Done!" printed underneath the cell. 

In [20]:
## DO NOT DELETE ANYTHING IN THIS CELL ##

# import utilities
import numpy as np
import pandas as pd
from datascience import *

print("Done!")

Done!


### 1.1 Types of Cells

#### Text Cells
Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called Markdown to add formatting and section headings. You don't need to learn Markdown, but know the difference between Text Cells and Code Cells.

#### Code Cells 
Other cells contain code in the Python 3 language. Don't worry -- you are not expected to write your own code for this notebook. Be sure to review, however, how to **run** a code cell! 

#### Running Cells
"Running a cell" is equivalent to pressing "Enter" on a calculator once you've typed in the expression you want to evaluate: it produces an output. When you run a text cell, it outputs clean, organized writing. When you run a code cell, it computes all of the expressions you want to evaluate, and can output the result of the computation.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, you can either press the <code><b> ▶|</b> Run </code> button above or press <b><code>Shift + Return</code></b> or <b><code>Shift + Enter</code></b>. This will run the current cell and select the next one.


*For this class, you are not expected to write any code. You will be looking at plots generated by pre-written widgets. Your job is to write up your discussions and observations in the provided text cells.*

#### How to save your work
Click on the leftmost icon in the tool bar (left of the plus icon).
Alternatively, you can hit Ctrl+S on a PC or Command+Enter on a Mac.

### 1.2 Common errors and how to fix them

##### Accidentally deleted something in a cell? 
Double click on that cell and press Ctrl+Z or Command+Z until you recover the deleted information. Otherwise, ask your GSI for help. 

##### Getting a really long error message? 
This could be a result of deleting code somewhere in the notebook in a code cell. If you remember which cell you deleted code from, double click on that cell and press Ctrl+Z or Command+Z until the code is as it was originally. Otherwise, raise your hand and ask for help from your GSI. 

##### Getting the error 'data_frame' not present? 
Make sure you have all of the relevant data files in your datahub home. Download the .csv files. Upload them into datahub in the folder where you see this notebook is stored. 

##### Something is running for too long? 
Try restarting the kernel. Sometimes Jupyter might be overloaded with the simplest of commands. Don't worry, just save your work and restart the kernel. 

### 1.3 Where to get help - Peer consultants

This class as well as the Division of Computing, Data Science, and Society has many resources to help you gain the most out of this assignment. Primarily, you can ask your GSI or the peer consultants walking around this lab section for help if you run into any issues. Otherwise, the Division has walk-in [data science office hours](https://data.berkeley.edu/academics/resources/peer-consulting) at Moffitt 3rd floor where you can drop by and ask for help with your notebook. 

---

# 2. Intro to Python <a id='1'></a>

Python is a popular programming language used by many; with Python, people create software  to analyze quantitative data, develop websites, manage finances, etc. In Python, you can create **expressions**, **functions**, or **variables** that help you work with and understand the data you are using.

### Example of an Expression

In [21]:
20+20

40

### Example of a Function
Functions are series of expressions that take in inputs, perform some action with them, and return some output. The print() function is a very simple example. It takes in some number or text, and simply outputs its input.

In [22]:
print("Hello")

Hello


### Example of a Variable
An important type of variable is a **string**. Strings are sequences of characters, such as words or sentences. Strings are always surrounded by quotes.

In [23]:
# String
"Immunity"

'Immunity'

### Python Power!

Python is used all the time in the field of immunity and disease! Check out this ['Towards Data Science'](https://towardsdatascience.com/modelling-the-coronavirus-epidemic-spreading-in-a-city-with-python-babd14d82fa2) article, which models and simulates the spread of COVID-19 in the city of Yerevan. Using Python in this manner is really helpful because it outlines potential outcomes of this pandemic, without actually having to witness them

---

# 3. Intro to MRSA <a id='2'></a>


TO BE FILLED OUT BY PROFESSOR 

### Discussion Question 1: 
**From what you know about MRSA, where do you think the majority of MRSA cases are located in California (urban counties vs. rural counties)? Why might this difference in population density be significant in the total number of MRSA infections?**

REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 4. Intro to Data <a id='3'></a>

### 4.1 Reading in Data Sets

In [24]:
# This cell will read in the necessary data sets. Run it and take a look at the dataframes below!
mrsa_merged = Table.read_table('mrsa_merged.csv') #merged mrsa data
infec_pop_merge = Table.read_table('infec_pop_merge.csv') #combined mrsa and population data

print("Done!")

Done!


---

### 4.2 Understanding the Data

Lets see what the raw MRSA data from the year 2013, looks like. You can scroll horizontally to see all of the columns that are included in the table.

In [25]:
mrsa_2013_raw = Table.read_table('mrsa raw data/mrsa-in-hospitals-2013.csv') #raw data
mrsa_2013_raw.show(5)

Year,State,HAI,Facility_Name1,Facility_Name2,Facility_Name3,FACID1,FACID2,FACID3,County,Infection_Count,Patient_Days,SIR,SIR_95%_CI_Lower_Limit,SIR_95%_CI_Upper_Limit,Comparison,Notes
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Adventist Medical Center, Hanford",Adventist Medical Center-Selma,.,40000122,630012960,.,Kings-Fresno,2,42875,0.76,0.13,2.5,No Different,† See Data Dictionary
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Adventist Medical Center, Reedley",.,.,40000124,.,.,Fresno,0,5970,,,,.,
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"AHMC Anaheim Regional Medical Center, Anaheim",.,.,60000002,.,.,Orange,3,51929,1.15,0.29,3.13,No Different,
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Alameda County Medical Center, Oakland","Alameda County Medical Center-Fairmont Campus, San Leandro",.,140000034,140000184,.,Alameda,7,55590,2.09,0.91,4.12,No Different,† See Data Dictionary
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,Alameda Hospital,.,.,140000011,.,.,Alameda,0,11520,0.0,0.0,2.88,No Different,


Shown above are the first five rows of the the data table. As you can see, there are a lot of columns as well as a lot of missing information in some of the columns. We have cleaned the data for you by removing any unncessary features and renaming the columns to make their purpose more clear. See the cleaned data set from 2013 below. 

In [26]:
mrsa_2013 = Table.read_table('mrsa cleaned data/mrsa_2013.csv') #cleaned data
mrsa_2013.show(5)

Year,State,HAI,Facility1,Facility1_ID,County,Infection_Count,Num_patients
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Adventist Medical Center, Reedley",40000124,Fresno,0,5970
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"AHMC Anaheim Regional Medical Center, Anaheim",60000002,Orange,3,51929
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Alameda County Medical Center, Oakland",140000034,Alameda,7,55590
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,Alameda Hospital,140000011,Alameda,0,11520
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,Alhambra Hospital Medical Center,930000005,Los Angeles,0,19433


---

### 4.3 Breaking down the table

#### 4.3.1 Rows

Let's take a look at the first row of the mrsa 2013 data set. 

In [27]:
mrsa_2013.take(0)

Year,State,HAI,Facility1,Facility1_ID,County,Infection_Count,Num_patients
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Adventist Medical Center, Reedley",40000124,Fresno,0,5970


What does this particular row, or record, represent? See if you can figure out what kind of information is held in this record. 

Thought about it? If you concluded that this is a record of a report done in 2013 in California of the count of MRSA infections at the Adventist Medical Center, Reedley in Fresno County, you got it! 

Try analyzing another row, just change the number '0' in the above code cell to see that numbered row (make sure your number is less than 352 -- that's the number of rows in the table. 

#### 4.3.2. Columns
See below a list of the columns in the cleaned data set. 

In [28]:
mrsa_2013.take(np.arange(0,3))

Year,State,HAI,Facility1,Facility1_ID,County,Infection_Count,Num_patients
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Adventist Medical Center, Reedley",40000124,Fresno,0,5970
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"AHMC Anaheim Regional Medical Center, Anaheim",60000002,Orange,3,51929
2013,California,Methicillin-Resistant Staphylococcus aureus Bloodstream ...,"Alameda County Medical Center, Oakland",140000034,Alameda,7,55590


Take a look at the columns in this snippet of the table. Most of the columns should be pretty self-explanatory. For those that aren't, 
- HAI is Hospital Acquired Infection. So the data we are looking at refers to MRSA infections in hospital patients only. 
- Facility1 is the name of the specific medical center/hospital
- Facility1_ID is a unique number identifier for the medical facility

### 4.4  The Data Source
Our primary data source is data.gov 

See this [link](https://data.chhs.ca.gov/dataset/methicillin-resistant-staphylococcus-aureus-mrsa-bloodstream-infections-bsi-in-california-hospitals) for more information on the data source.

### Discussion Question 2:
**Of the questions we asked about MRSA, which are actually answerable given this dataset? Which questions could be answerable with a different dataset? And which would be hard to answer with the data we gave you?**

REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 5. Comparing Infection Rates Over Time <a id='4'></a>
Goal: See if and how infection rates change over the time period for which we have data

Use the plot widget to answer the discussion question below. Run the code cell below and toggle toggle through the drop-down menu to look at infection counts for different counties. 

In [29]:
from widgets import infection_rates_per_county

infection_rates_per_county()

SyntaxError: invalid syntax (widgets.py, line 3)

### Discussion Question 3:
**Choose one urban county and one rural county to evaluate. What trends can you identify? What outliers do you see? Make a general statement on how MRSA infections have changed over time in these two counties.**


REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 6. Comparing County Infection Rates with County Populations <a id='5'></a>


### 6.1 Infection rate by county per year

What is the trend over the years of total population against infection counts? The following code cell displays a widget that plots a regression line over these two variables. See if you can catch an interesting trend. 

In [30]:
from widgets import population_v_infection_by_county

population_v_infection_by_county()

SyntaxError: invalid syntax (widgets.py, line 3)

### Discussion Question 4: 

**Look at the same two urban and rural counties that you chose for the previous question. What is the above graph showing for these two counties?** 

REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

### 6.2 Infection rate across counties by year 

In [31]:
from widgets import population_vs_infection_by_year
population_vs_infection_by_year()

SyntaxError: invalid syntax (widgets.py, line 3)

### Discussion Question 5: 

**Look at the same urban and rural counties that you chose above. What is the above graph showing for these two counties?** 

REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

### 6.3 California Counties Colored by Infection Rate (number of infections/population)

Below is a a chloropleth map which shows each California County and the rate of infection per unit of population (100,000 people). The darker red the county, the higher its MRSA infection rate is. The colorbar on the right side of the map shows the range of infection rates. Take a look at the map below and use it to answer the discussion question below. 

![](ca_map.png)

### Discussion Question 6:
**Having looked at the chloropleth map of California counties above, what trends/outliers do you see? Find your two chosen urban and rural counties from previous questions. Are the MRSA infection rates (denoted by color) consistent with what you originally thought? What would you say overall about the relationship between infection rates and population?**




REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

### [OPTIONAL] 7. Bonus Challenge <a id='6'></a>

<div class="alert alert-block alert-warning">
    <b>BONUS CHALLENGE (for folks who have taken Data-8 or beyond)</b>
    <br />
    Our data sets also have variables we didn’t use, including population split into race and sex subcategories, and infection rates by hospital or predicted vs actual infection rates. Try to alter the given code to compare two new variables. What patterns do you see?

</div>



**Provided below is a table containing California Census population data. This data is grouped by Year and County. It includes the Total Populations, as well as populations by race and sex.**

In [44]:
census_bonus = Table.read_table('census_bonus.csv').group(["Year", "County"], sum).show(5)
census_bonus

Year,County,Total_Population sum,White Male sum,White Female sum,Black Male sum,Black Female sum,American Indian Male sum,American Indian Female sum,Asian Male sum,Asian Female sum,Native Hawaiian Male sum,Native Hawaiian Female sum,2+ Race Male sum,2+ Race Female sum,Not Hispanic Male sum,Not Hispanic Female sum,Hispanic Male sum,Hispanic Female sum
2013,Alameda County,3160494,821682,818902,184490,208928,18372,17306,429146,468232,14974,16070,79360,83032,1186842,1260984,361182,351486
2013,Alpine County,2254,846,772,8,2,270,270,10,14,0,0,42,20,1058,1002,118,76
2013,Amador County,73250,34960,31692,1344,176,896,660,462,500,92,78,1194,1196,33288,30762,5660,3540
2013,Butte County,443328,190318,195390,4334,3308,5408,5536,9832,9738,610,578,8946,9330,185206,191088,34242,32792
2013,Calaveras County,89320,41190,41248,542,288,824,768,542,726,96,86,1488,1522,39542,39924,5140,4714


WRITE YOUR OBSERVATIONS HERE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 8. Submit Your Work <a id='7'></a>