# MCB 50: Immunity & Disease

**Estimated Time**: 30-40 minutes <br>
**Databook created by**: Rucha Kelkar, Harry Li, Elias Saravia

Today we will be examining a dataset (i.e. a table) on.....

### Table of Contents
1. [Intro to Jupyter](#0) <br>
1. [Intro to Python](#1) <br>
1. [Intro to MRSA](#2) <br>
1. [Intro to dataset](#3) <br>
1. [Question 1](#4) <br>
1. [Question 2](#5) <br>

# 1. Intro to Jupyter <a id='0'></a>

This webpage is a Jupyter Notebook. Jupyter Notebooks run on the Python coding language and provides an interactive interface for students. We will use this notebook to analyze a dataset on MRSA bloodstream infections in California hospitals. Jupyter Notebooks are composed of both regular text and code cells. Code cells have a gray background. In order to run a code cell, click the cell and press Shift + Enter while the cell is selected or hit the ▶| Run button in the toolbar at the top. You can also save your work using the button on the top left hand corner.

An example of a code cell is shown below. The contents of this cell set up the notebook by importing pre-written Python packages that we will be using to read in, clean, analyze, and model our data. Run it, and once it is done, you should see "Done!" printed underneath the cell. 

In [1]:
## DO NOT DELETE ANYTHING IN THIS CELL ##

# import utilities
import numpy as np
import pandas as pd

# plotting packages
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

# to work the widgets
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

print("Done!")

Done!


### 1.1 Types of Cells

#### Text Cells
Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called Markdown to add formatting and section headings. You don't need to learn Markdown, but know the difference between Text Cells and Code Cells.

#### Code Cells 
Other cells contain code in the Python 3 language. Don't worry -- we'll show you everything you need to know to succeed in this part of the class.

The fundamental building block of Python code is an **expression**. Cells can contain multiple lines with multiple expressions. We'll explain what exactly we mean by "expressions" in just a moment: first, let's learn how to "run" cells.

#### Running Cells
"Running a cell" is equivalent to pressing "Enter" on a calculator once you've typed in the expression you want to evaluate: it produces an output. When you run a text cell, it outputs clean, organized writing. When you run a code cell, it computes all of the expressions you want to evaluate, and can output the result of the computation.

<p></p>
<div class="alert alert-info">
To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, you can either press the <code><b> ▶|</b> Run </code> button above or press <b><code>Shift + Return</code></b> or <b><code>Shift + Enter</code></b>. This will run the current cell and select the next one.
</div>


*For this class, you are not expected to write any code. You will be looking at plots generated by pre-written widgets. Your job is to write up your discussions and observations in the provided text cells.*

#### How to save your work
Click on the leftmost icon in the tool bar (left of the plus icon).
Alternatively, you can hit Ctrl+S on a PC or Command+Enter on a Mac!

### 1.2 Common errors and how to fix them
Do we want errors specific to the errors they would run into on this notebook? or general errors you might get in jupyter?

examples of errors, add more cells if necessary

### 1.3 Where to get help - Peer consultants

This class as well as the Division of Computing, Data Science, and Society has many resources to help you gain the most out of this assignment. Primarily, you can ask your peer consultants walking around this lab section for help if you run into any issues. 

Hi from your peer consultants!

---

# 2. Intro to Python <a id='1'></a>

TODO: TEXT DESCRIPTION STUFF ABOUT PYTHON 

TODO: SOME EASY CODE EXAMPLES IN PYTHON?

@Professor: What kinds of python examples are most useful?

---

# 3. Intro to MRSA <a id='2'></a>


### 3.1 What is MRSA?

MRSA - Methicillin-resistant Staphylococcus aureus is one strain of bacterium (commonly known as a staph infection) that is resistant to several common antibiotics. It exists in both community settings as well as in hospital settings. In the community, MRSA most often causes skin infections, such as boils or rashes, or even pneumonia in some cases. In a hospital setting, it can cause severe problems including bloodstream infections, pneumonia, or surgical site infections. 

##To be filled in by prof

### 3.2 How can it be treated?

The MRSA bateria is spread by direct contact with people who are infected or carrying it. Though many people may be carrying the MRSA bacterium, it is not a problem until it infects the body, often through a cut or open wound. Thus, the best way to keep the infection from occurring is to regularly maintain good hand and body hygiene, keep cuts and scrapes clean and covered, and avoid sharing personal items such as dirty clothes, towels, or razors. 

### 3.3 More Information

To read more about MRSA, see this [link](https://www.cdc.gov/mrsa/community/index.html)<br>
To read more about the California data sets on MRSA infections that we will be working with, see this [link](https://data.chhs.ca.gov/dataset/methicillin-resistant-staphylococcus-aureus-mrsa-bloodstream-infections-bsi-in-california-hospitals)

### Discussion Question 1: 
**As scientists studying infectious diseases, what kinds of questions should we be asking about MRSA?**

REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 4. Intro to Data <a id='3'></a>

### 4.1

In [2]:
# This cell will read in the necessary data sets. Run it and take a look at the dataframes below!

mrsa_merged = pd.read_csv('mrsa_merged.csv') #merged mrsa data
infec_pop_merge = pd.read_csv('infec_pop_merge.csv') #combined mrsa and population data

print("Done!")

Done!


---

Lets see what the raw MRSA data from the year 2013, looks like. You can scroll horizontally to see all of the columns that are included in the table.

In [3]:
mrsa_2013_raw = pd.read_csv('mrsa raw data/mrsa-in-hospitals-2013.csv') #raw data
mrsa_2013_raw.head()

Unnamed: 0,Year,State,HAI,Facility_Name1,Facility_Name2,Facility_Name3,FACID1,FACID2,FACID3,County,Infection_Count,Patient_Days,SIR,SIR_95%_CI_Lower_Limit,SIR_95%_CI_Upper_Limit,Comparison,Notes
0,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"Adventist Medical Center, Hanford",Adventist Medical Center-Selma,.,40000122,630012960,.,Kings-Fresno,2.0,42875.0,0.76,0.13,2.5,No Different,† See Data Dictionary
1,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"Adventist Medical Center, Reedley",.,.,40000124,.,.,Fresno,0.0,5970.0,,,,.,
2,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"AHMC Anaheim Regional Medical Center, Anaheim",.,.,60000002,.,.,Orange,3.0,51929.0,1.15,0.29,3.13,No Different,
3,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"Alameda County Medical Center, Oakland","Alameda County Medical Center-Fairmont Campus,...",.,140000034,140000184,.,Alameda,7.0,55590.0,2.09,0.91,4.12,No Different,† See Data Dictionary
4,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,Alameda Hospital,.,.,140000011,.,.,Alameda,0.0,11520.0,0.0,0.0,2.88,No Different,


Shown above are the first five rows of the the data table. As you can see, there are a lot of columns as well as a lot of missing information in some of the columns. We have cleaned the data for you by removing any unncessary features and renaming the columns to make their purpose more clear. See the cleaned data set from 2013 below. 

In [4]:
mrsa_2013 = pd.read_csv('mrsa cleaned data/mrsa_2013.csv') #cleaned data
mrsa_2013.head()

Unnamed: 0,Year,State,HAI,Facility1,Facility1_ID,County,Infection_Count,Num_patients
0,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"Adventist Medical Center, Reedley",40000124,Fresno,0.0,5970.0
1,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"AHMC Anaheim Regional Medical Center, Anaheim",60000002,Orange,3.0,51929.0
2,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"Alameda County Medical Center, Oakland",140000034,Alameda,7.0,55590.0
3,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,Alameda Hospital,140000011,Alameda,0.0,11520.0
4,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,Alhambra Hospital Medical Center,930000005,Los Angeles,0.0,19433.0


---

### 4.2 Breaking down the table

#### 4.2.1 Rows

Let's take a look at the first row of the mrsa 2013 data set. 

In [5]:
mrsa_2013.take([0])

Unnamed: 0,Year,State,HAI,Facility1,Facility1_ID,County,Infection_Count,Num_patients
0,2013,California,Methicillin-Resistant Staphylococcus aureus Bl...,"Adventist Medical Center, Reedley",40000124,Fresno,0.0,5970.0


What does this particular row, or record, represent? See if you can figure out what kind of information is held in this record. 

Thought about it? If you concluded that this is a record of a report done in 2013 in California of the count of MRSA infections at the Adventist Medical Center, Reedley in Fresno County, you got it! 

Try analyzing another row, just change the number '0' in the above code cell to see that numbered row (make sure your number is less than 352 -- that's the number of rows in the table. 

#### 4.2.2. Columns
See below a list of the columns in the cleaned data set. 

In [6]:
mrsa_2013.columns.tolist()

['Year',
 'State',
 'HAI',
 'Facility1',
 'Facility1_ID',
 'County',
 'Infection_Count',
 'Num_patients']

Self-explanatory?

### 4.3  The Data Source
Data source description (coming soon...)

See this [link](https://data.chhs.ca.gov/dataset/methicillin-resistant-staphylococcus-aureus-mrsa-bloodstream-infections-bsi-in-california-hospitals) for more information on the data source.

### Discussion Question 2:
**Of the questions we asked about MRSA, which are actually answerable given this dataset? Which questions could be answerable with a different dataset? And which would be hard to answer with data (and why)?**

REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 5. Comparing Infection Rates Over Time
Goal: See if and how infection rates change over the time period for which we have data

Use the plot widget to answer the discussion question below. Run the code cell below and toggle toggle through the drop-down menu to look at infection counts for different counties. 

In [7]:

from widgets import infection_rates_per_county

infection_rates_per_county()



interactive(children=(Dropdown(description='County', options=('Fresno', 'Orange', 'Alameda', 'Los Angeles', 'S…

In [8]:
# line graph widget
def line_county(county):
    plt.figure(figsize=(10,5));
    x = list(mrsa_merged.loc[mrsa_merged['County']== county].groupby(['Year']).agg(sum).index)
    y = list(mrsa_merged.loc[mrsa_merged['County']== county].groupby(['Year']).agg(sum)['Infection_Count'])
    sns.lineplot(x,y)
    title = 'Infection Count in '+county+' County Per Year'
    plt.title(title)
    plt.xlabel("Year")
    plt.ylabel("Infection Count");
    return 

wid_1 = widgets.Dropdown(
        options = mrsa_merged['County'].unique().tolist(),
        description = 'County',
        disabled = False
)

interact(line_county, county = wid_1);

interactive(children=(Dropdown(description='County', options=('Fresno', 'Orange', 'Alameda', 'Los Angeles', 'S…

### Discussion Question 3:
**What trends do you identify? What outliers do you see? How would you answer our question (TODO: WHAT IS OUR QUESTION)?**


REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

---

# 6. Comparing Country Infection Rates with County Populations


### 6.1 Population versus infection rate for each county

What is the trend over the years of total population against infection counts? The following code cell displays a widget that plots a regression line over these two variables. See if you can catch an interesting trend. 

In [9]:
# yuba county has data from 3 years only
# no counties with min rate>=1.5 & max rate<=3.5 so the bins below work

In [10]:
df = infec_pop_merge.loc[infec_pop_merge['County'] == 'Humboldt']
df

df['year_norm'] = [0,1,2,3,4,5]

#p = sns.lmplot(x='year_norm',y='Infec_Div_Pop',data=df,ci=None,height=6,aspect=2)

np.mean(df['Infec_Div_Pop'])/np.mean(df['year_norm'])

# how to find the slope
'''
1. find slopes of 2013-14, 2014-15, ..., 2017-18
2. average said slopes
'''

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


'\n1. find slopes of 2013-14, 2014-15, ..., 2017-18\n2. average said slopes\n'

In [11]:
from widgets import population_v_infection_by_county

population_v_infection_by_county()

interactive(children=(Dropdown(description='County', options=('Alameda', 'Amador', 'Butte', 'Calaveras', 'Colu…

In [12]:
# scatter plot widget - infection rate per 100,000 people by county each year 
def pop_v_infec_by_county(county):    
    df = infec_pop_merge.loc[infec_pop_merge['County'] == county]  
    p = sns.lmplot(x='Year',y='Infec_Div_Pop',data=df,ci=None,height=6,aspect=2)
    title = 'Infection Count Per 100,000 People in '+county+' County'
    plt.title(title)
    plt.xlabel("Year")
    plt.ylabel("Infection Rate")
    plt.setp(p.ax.lines,linewidth=2)
    
    ylims = (-.1,2)
    if (df['Infec_Div_Pop'].min()>=2) and (df['Infec_Div_Pop'].max()<=4):
        ylims = (1.9,4)
    
    plt.ylim(ylims[0],ylims[1])
    
   # print('Correlation: ',df.corr()['Total_Population']['Infection_Count'])
    return 

wid_2 = widgets.Dropdown(
        options = infec_pop_merge['County'].unique().tolist(),
        description = 'County',
        disabled = False
)

interact(pop_v_infec_by_county, county = wid_2);

interactive(children=(Dropdown(description='County', options=('Alameda', 'Amador', 'Butte', 'Calaveras', 'Colu…

What is the above graph showing?
- Shows number of MRSA cases reported (infection count) per 100,000 people in the total population of the county. 
- Correlation is the slope of the line. For example, in Alameda County, infection rate per 100,000 people increases by about .5118 (corr = .5118) each year. 
- Confusions: correlation intutition is unintuitive because the points are jumping around and very obviously not incresaing at a steady rate each year. Is 100,000 people the right population unit (as opposed to 100;000;1,000,000 people)

In [13]:
from widgets import population_vs_infection_by_year
population_vs_infection_by_year()

interactive(children=(Dropdown(description='Year', options=(2013, 2014, 2015, 2016, 2017, 2018), value=2013), …

In [77]:
# scatter plot widget - infection rate per 100,000 people by county each year 
def pop_v_infec_by_year(year):    
    
    df = infec_pop_merge.loc[infec_pop_merge['Year'] == year]  
    df = df.drop(df['Total_Population'].idxmax())
    df['pop_by_100k'] = infec_pop_merge['Total_Population']/100000
    
    p = sns.lmplot(x='pop_by_100k',y='Infection_Count',data=df,ci=None,height=6,aspect=2)
    title = 'Infection Count Across Counties in Year '+ str(year)
    plt.title(title)
    plt.xlabel("Total Population Unit 100,000 People")
    plt.ylabel("Infection Count")
    plt.setp(p.ax.lines,linewidth=2)
       
    plt.ylim(-5,83)
    
    print('Slope of Regression Line: ',df.corr()['pop_by_100k']['Infection_Count'])
    return 

wid_year = widgets.Dropdown(
        options = infec_pop_merge['Year'].unique().tolist(),
        description = 'Year',
        disabled = False
)

interact(pop_v_infec_by_year, year = wid_year);

interactive(children=(Dropdown(description='Year', options=(2013, 2014, 2015, 2016, 2017, 2018), value=2013), …

### [OPTIONAL] 6.2 Chloropleth map

In [None]:
# chloropleth map widget 

### Discussion Question 4:
**What trends/outliers do you see? What would you say overall about the relationship between infection rates and population?**



REPLACE THIS TEXT WITH YOUR RESPONSE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]

<div class="alert alert-block alert-warning">
    <b>BONUS CHALLENGE (for folks who have taken Data-8 or beyond)</b>
    <br />
    Our data sets also have variables we didn’t use, including population split into race and sex subcategories, and infection rates by hospital or predicted vs actual infection rates. Try to alter the given code to compare two new variables. What patterns do you see?

</div>



WRITE YOUR OBSERVATIONS HERE
<br> [Note: double click on the cell, type in your response, and run the cell to save your work]