# Formulas, adding new fields and merging your Group Bys

You can do formulas in python just as you would in Excel - they just look a little different!

We are going to use the vaccine data you worked with when you were studying for the Excel exam to learn how to do that. First let's start by importing pandas AND importing a new toolbox: numpy. Numpy allows you to do if/then statements. 

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('measlesvaccinedata.csv')

# Creating new columns
You will run into a lot of reasons for creating new columns. These are just a few 

-Doing math on multiple columns. If you have the percent of kids vaccinated, and the total number of kids, and you want to know the number of vaccinated kids, you need to multiple these two fields together. Same goes for calculating rates: you will create columns frequently that is the product of dividing one column by the other to get a percent. 

-Conditionals. This is your if/then statements. Say you want to have a column that says "yes" or "no" for herd immunity. 

-Creating a marker. Say you have a bunch of csvs that all contain different years, but the year the data represents is not actually a column in the data - it's just in the name of the file. (This happens a lot.) It is good data practice to make sure you've got important labels clearly attached to each row of your data. (Especially if you plan on making one giant csv that contains multiple years.) Another example might be that you have different datasets containing the same information, but for a different city in each dataset. Salaries, for example. Eventually you'll want to put them all into the same file, so you need to create a field for each of your files that indicates which city each row is associated with. 

Run df.head() with this vaccine data. Do we know what year we're looking at? No. let's add a year to it.

In [None]:
df['YEAR'] = '2019-2020'

Here's the anatomy of this: 

First we must tell the computer in what dataset we are creating a new field. That's df here. 

Then we tell the computer what the name of the new column is. We have to put it in brackets and single or double quotes.

Then, we tell the computer what to put in the new column by saying that it equals the thing we want to put in the new column. In this case, we just want it to say '2019-2020' in every single row. 

Try df.head() and see how your new data looks. 

And now, let's add a column to designate that we are working with Arizona data. 

What would that look like? Put it in the cell below. 

Now that we've learned how to make a new column that serves as a marker, let's move on to math.

Your operators will be

an asterisk * for multiplying 
a forward slash / for division 
minus symbol - for subtracts
plus sign + for adding 

To create a column based on math between two or more existing columns, you need to name the dataframe each column comes from, followed by the name of the column in brackets and quotes. 

In [None]:
df['vaccinated'] = df['ENROLLED'] * df['% IMMUNE MMR']

Now create a column that is represents the number of unvaccinated kids in the cell below. 

# Rounding
You can use .round(), followed by .astype(int) to round your numbers so you don't have decimal points. In this case, fractions of kids.

That would look like this:

df['column_name'] = df['column_name'].round()
df['column_name'] = df['column_name'].astype(int)

# Where statements, aka If/Then statements
You will run into all kinds of times when you need to use a conditional formula. For example, we know that herd immunity for the measles is 95 percent, and we want to make a field that just says 'yes' or 'no' so that we can do further Group Bys to determine things like, what percentage of charter schools are below herd immunity, or what percentage of schools per county are below herd immunity? 

Below is what that will look like. 

First, we start by creating a new column. We tell the computer what dataframe we're creating the new column in, and what we are calling it.

Then, we invoke "numpy." We called it np when we imported it. 

np.where() sets up the command. Think of this similarly to saying "if"

Next, we tell the computer the dataframe we're working with, followed by the column we want to do the formula to. Again, that's in brackets and double or single quotes. 

Now we tell the computer what operation we want to do. In this case, we want greater than or equal to .95 so we're using >=. Then follow with the text you want if the condition is met. This formula will turn out 'yes' for rows that are greater than or equal to .95 and 'no' for ones that are not. 

In [None]:
df['herdimmunity'] = np.where(df['% IMMUNE MMR'] >= .95, 'yes', 'no') 

You've learned group bys. So in the cell below, write the code that would allow you count how many schools are in each category, above or below herd immunity. 

# Merging GroupBys 
You've learned how to do GroupBys and how to merge datasets. Let's put that to practice by merging our groupbys to get percentages. 

The question: What percent of schools in each county are below herd immunity? 

The answer is a three step process. 

1. First, make a dataframe that is a groupby giving us the total number of schools per county. Try that below. 

2. Now we need to separate from our data the schools that are below herd immunity. We will do this by creating a new dataframe that excludes all schools that are above herd immunity. The code below effectively creates a new dataframe called 'belowherd' by selecting for only schools where herdimmunity is 'no'. 

In [None]:
belowherd = df[df['herdimmunity'] == 'no']

3. Now we need to do a group by that counts the number of schools that are below herd immunity. Put that in the cell below. 

4. Now we need to calculate percentages by county. First we must join our two GroupBys on county name so that we can do math between the belowherd column and the total schools column. That will look like this:

In [None]:
ratesbycounty = counties.merge(countiesbelowherd, left_on='COUNTY', right_on='COUNTY',how="left")

Now, based on what you've learned about formulas, add a column to your new dataframe (I've called it ratesbycounty) that calculates the percentage. You can send your findings to a csv with the code below:

In [None]:
ratesbycounty.to_csv('ratesbycounty.csv')

# Questions to answer for the in-class lab
What percent of schools are below herd immunity in each city? 
What percent of schools are below herd immunity in each school type? 
What percent of CHILDREN are unvaccinated in each county?
What percent of CHILDREN are unvaccinated in each city? 