In [None]:
import pandas as pd
# Then we'll load in our CSV file
df = pd.read_csv('datasets/Admission_Predict.csv', index_col=0)
# And we'll clean up a couple of poorly named columns like we did in a previous lecture
df.columns = [x.lower().strip() for x in df.columns]
# And we'll take a look at the results
df.head()

* querying dataframes is all about boolean masking

In [None]:
admit_mask=df['chance of admit'] > 0.7
admit_mask

* we can apply a mask in a couple of ways

In [None]:
df[admit_mask].head()

* of course, you don't have to make the mask object (and likely won't)

In [None]:
df[df['chance of admit'] > 0.7].head()


* we can also use the `where()` function, a subtle issue is that NaN's are left in for you.

In [None]:
df.where(admit_mask).head()

* the nice thing about `where()` is that it's easy to read
* often you mix it together with `dropna()`

In [None]:
df.where(admit_mask).dropna().head()

* masks can be composites, and made up of several conditions

In [None]:
(df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)

* the problem is, pandas doesn't know how to `and` two `Series` objects together.
* PEP 335: https://www.python.org/dev/peps/pep-0335/

* But it does know how to `&` them!

In [None]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

* But, you need to watch out for order of operations!

In [None]:
df[df['chance of admit'] > 0.7 & df['chance of admit'] < 0.9]

* finally, there are additional helped functions on dataframes to be aware of

In [None]:
df.where(df['chance of admit'].gt(0.7)).dropna().head()

# Indexing
* Let's go back to indexing dataframes, there's some neat stuff there
* Remember that the index are row level labels, and the column names are the column level labels
* We can swap columns and rows trivially

In [None]:
df.T

* we saw that we can set the index with `set_index()`

In [None]:
df.set_index('lor').head()

In [None]:
# of course, this didn't actually change our previous dataframe, right?
df.head()

# Multilevel indexing
* we can have heirarchical indicies, which is pretty cool
* let's look at some census data

In [None]:
import pandas as pd 
df=pd.read_csv("datasets/census.csv")
df.head()

In [None]:
# in this data there are only two sumlevels
print(len(df['SUMLEV'].unique()))
# so lets just get rid of anything except state level data
df=df[df['SUMLEV'] == 50]
df.head()

In [None]:
# we can set a multilevel index just by passing a list of things we want to index on
df=df.set_index(['STNAME', 'CTYNAME'])
df.head()

* Querying gets, frankly, complex
* Someone remind me, how does df.loc work again?
  * df.loc[X,Y]

* `df.loc[row, column]`
* But with a multiindex we can do
  * `df.loc[row index1, row index2]`

In [None]:
df.loc['Michigan', 'Washtenaw County']['REGION']

In [None]:
df.loc['Michigan', 'REGION']['Washtenaw County']

In [None]:
# It's a bit of ambiguous, I recommend passing keys as tuple instead
df.loc[('Michigan', 'Washtenaw County')]

# In class activity time!
## Question 0: Which county has the largest population in Michigan?
* (Easy)

## Question 1: Generate a new column for which is the largest absolute change in population within the period 2010-2015?
* (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)