## Selection by Labels
* We can extract data from the rows using the location (`.loc`) attribute
* Watch carefully....

In [1]:
import pandas as pd #importing pandas as pd
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
            {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
            {'Name': 'Mark', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['U-M', 'MSU', 'U-M'])
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [3]:
df['Class'] #to a grab a column which happens to be a series

U-M      Physics
MSU    Chemistry
U-M      Biology
Name: Class, dtype: object

In [5]:
df.loc['MSU'] #grabs {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82} from students

Name          Jack
Class    Chemistry
Score           82
Name: MSU, dtype: object

In [6]:
type(df.loc['MSU']) #type is a series

pandas.core.series.Series

* Two important considerations:
1. The return value seems to be a `Series` -- neat!
2. `.loc` is **not** a function.  

In [None]:
df.loc('MSU') #results in error because only has quotation marks, should think of it as a numpy array that you are indexing into

~~loc()~~ is not a thing. it's loc\[\]. Think about this as a numpy array and it will make more sense -- you're just indexing into the array.

In [7]:
# reminder what our dataframe looks like
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [8]:
df.loc['U-M'] #gets us a dataframe grabs both rows with index U-M. Docloc grabs by labels

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
U-M,Mark,Biology,90


In [11]:
type(df.loc['U-M']) #type is dataframe

pandas.core.frame.DataFrame

In [10]:
df.loc['U-M', 'Score'] #grabs U-M and its score and is a series because its grabbing the one column

U-M    85
U-M    90
Name: Score, dtype: int64

In [12]:
# we can use loc to index in (you saw this)
#df.loc["MSU"]
# we can also add the second dimension, column names, to the index
df.loc['U-M', ['Score']] #returns a Dataframe because of the second dimension in brackets

Unnamed: 0,Score
U-M,85
U-M,90


In [13]:
# what if we want two columns?
df.loc['U-M', ['Class', 'Score']] #returns dataframe U-M rows and the class and score columns.     

Unnamed: 0,Class,Score
U-M,Physics,85
U-M,Biology,90


## Other ways of slicing: by Index
* So, `.loc` allows us to index in both dimensions of the dataframe, and allows us to slice by both index and column.
* `.loc` has a sibling though, `.iloc`. This stands for integer location. So you can slice by the row or column number

In [14]:
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [15]:
df.iloc[0,1] #row 0, column 0 then 1  

'Physics'

In [16]:
# Oh, and slicing? Check ✅
df.iloc[0:2, 0:2] #returns 0 to 2 for rows, and 0 to 2 for columns

Unnamed: 0,Name,Class
U-M,Alice,Physics
MSU,Jack,Chemistry


In [17]:
df.iloc[0:3, 0:3] #would return the whole thing

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


* Great, we have a DataFrame. A two dimensional data storage object with row indexes and column names.
* We can get data out a row or column at a time, or narrow down to specific row/column combinations.
* And we can pull data out using nice labels (strings!) or integer locations.

### An Aside (and a warning)...

I'm going to show you something I would encourage you to never use

I mean, it looks nice, but it's really going to bite you later....

In [18]:
df.Name.MSU #way of grabbing elements, called dot notation. In this instance goes to name and then msu and finds Jack

'Jack'

In [19]:
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [20]:
df['advanced'] = [True, False, True] #creates an new column
df

Unnamed: 0,Name,Class,Score,advanced
U-M,Alice,Physics,85,True
MSU,Jack,Chemistry,82,False
U-M,Mark,Biology,90,True


In [21]:
df.advanced #series with True, False, True from df with the column that was just created

U-M     True
MSU    False
U-M     True
Name: advanced, dtype: bool

In [22]:
df['advanced two'] = [1, 2, 3] #creating another column
df

Unnamed: 0,Name,Class,Score,advanced,advanced two
U-M,Alice,Physics,85,True,1
MSU,Jack,Chemistry,82,False,2
U-M,Mark,Biology,90,True,3


In [None]:
df.advanced two #won't work because of space, when you are using dot notation you are limited in terms of naming

In [23]:
df.advanced_three = [4, 5, 6] #creating column using dot notation 
#get a warning - you can't creating columns using an attribute name

  """Entry point for launching an IPython kernel.


In [24]:
df.advanced_three #seems like there is something stored but all its done is create an attribute but if you print out the data frame its not there

[4, 5, 6]

In [25]:
df #advanced_three not there

Unnamed: 0,Name,Class,Score,advanced,advanced two
U-M,Alice,Physics,85,True,1
MSU,Jack,Chemistry,82,False,2
U-M,Mark,Biology,90,True,3


Pandas devs add the column name as an attrbute to the DataFrame and this is used to index directly into the dataframe.

See https://www.dataschool.io/pandas-dot-notation-vs-brackets/

Please, just forget you saw this.

### Indexing by Callable

In [26]:
import pandas as pd
# We'll load in our CSV file
df = pd.read_csv('datasets/Admission_Predict.csv', index_col=0)
# And we'll clean up a couple of poorly named columns like before
df.columns = [x.lower().strip() for x in df.columns]  #turns all column titles to lower case and strips white space
# And we'll take a look at the results
df.head() #gets first few rows

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


* Querying dataframes is all about boolean masking

In [27]:
admit_mask = df['chance of admit'] > 0.7  #looks at chance of admit and returns True if greater than .7 for df
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

* We can apply a mask in a couple of ways

In [28]:
df[admit_mask].head() #grabs everyone greater than .7 and returns first few rows

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


* Of course, you don't have to make the mask object (and likely won't)!

In [30]:
df[df['chance of admit'] > 0.7 ].head() #exactly the same as df[mask], you don't have to create a mask

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


* We can also use the `where()` function, a subtle issue is that NaN's are left in for you.

In [31]:
df.where(admit_mask).head() #where() replaces values where the condition is False, In example, NaN replaces those that are False

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


* The nice thing about `where()` is that it's easy to read
* Often you mix it together with `dropna()`

In [32]:
df.where(admit_mask).dropna().head() #where mask is false it replaces wit NaNs, then drops rows with NaNs, then shows first few rows

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


* Masks can be composites, and made up of several conditions

In [33]:
(df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9) #pandas doesnt know how to use the "and" of two series objects together

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

* The problem is, pandas doesn't know how to `and` two `Series` objects together.
* PEP 335: https://www.python.org/dev/peps/pep-0335/
* But it does know how to `&` them!

In [35]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9) # & works

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

* But, you need to watch out for order of operations!

In [36]:
df['chance of admit'] > 0.7 & df['chance of admit'] < 0.9 # missing paranthesis causes errors due to order of operations

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

* Finally, there are additional helper functions on dataframes to be aware of

In [None]:
df.where(df['chance of admit'].gt(0.7)).dropna().head() # gt = greater than  lt = less than        where chance of admit is greater than .7 than replaces those below with NaNs than drops NaNs then shows first few rows

## More Indexing
* Let's go back to indexing dataframes, there's some neat stuff there
* Remember that the index are row level labels, and the column names are the column level labels
* We can swap columns and rows trivially

In [37]:
df.T #T means transpose, rows are now columns and columns are now rows

Serial No.,1,2,3,4,5,6,7,8,9,10,...,391,392,393,394,395,396,397,398,399,400
gre score,337.0,324.0,316.0,322.0,314.0,330.0,321.0,308.0,302.0,323.0,...,314.0,318.0,326.0,317.0,329.0,324.0,325.0,330.0,312.0,333.0
toefl score,118.0,107.0,104.0,110.0,103.0,115.0,109.0,101.0,102.0,108.0,...,102.0,106.0,112.0,104.0,111.0,110.0,107.0,116.0,103.0,117.0
university rating,4.0,4.0,3.0,3.0,2.0,5.0,3.0,2.0,1.0,3.0,...,2.0,3.0,4.0,2.0,4.0,3.0,3.0,4.0,3.0,4.0
sop,4.5,4.0,3.0,3.5,2.0,4.5,3.0,3.0,2.0,3.5,...,2.0,2.0,4.0,3.0,4.5,3.5,3.0,5.0,3.5,5.0
lor,4.5,4.5,3.5,2.5,3.0,3.0,4.0,4.0,1.5,3.0,...,2.5,3.0,3.5,3.0,4.0,3.5,3.5,4.5,4.0,4.0
cgpa,9.65,8.87,8.0,8.67,8.21,9.34,8.2,7.9,8.0,8.6,...,8.24,8.65,9.12,8.76,9.23,9.04,9.11,9.45,8.78,9.66
research,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0
chance of admit,0.92,0.76,0.72,0.8,0.65,0.9,0.75,0.68,0.5,0.45,...,0.64,0.71,0.84,0.77,0.89,0.82,0.84,0.91,0.67,0.95


* We saw that we can set the index with `set_index()`

In [39]:
df.set_index('lor').head() #set index to lor 

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,cgpa,research,chance of admit
lor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4.5,337,118,4,4.5,9.65,1,0.92
4.5,324,107,4,4.0,8.87,1,0.76
3.5,316,104,3,3.0,8.0,1,0.72
2.5,322,110,3,3.5,8.67,1,0.8
3.0,314,103,2,2.0,8.21,0,0.65


In [41]:
# Of course, this didn't actually change our previous dataframe, right?
display(df.head()) #still serial number
ndf = df.where(df['sop'] > 4.1) #now ndf shows only rows where sop is > than 4.1 and turns the rest to NaN
ndf.head() #everything is now nans because the sop is less than 4.1

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,,,,,,,,
3,,,,,,,,
4,,,,,,,,
5,,,,,,,,


## Multilevel indexing
* We can create hierarchial indicies, which is pretty neat
* Let's look at some (old) census data

In [4]:
import pandas as pd #import pandas 
df=pd.read_csv("datasets/census.csv") #reads file
df.head() #prints first few rows

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [5]:
# In this data there are only two sumlevels
# ...so lets just get county level data
df['SUMLEV'].unique() #looks at all the unique SUMLEV which returns 40 and 50 because thats the only two options

array([40, 50])

In [6]:
df[df['SUMLEV'] == 40]#grabbing all the rows where SUMLEV is 40, all aggragates at the state level

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
68,40,4,9,2,0,Alaska,Alaska,710231,710249,714021,...,-1.173489,-1.946424,-3.915107,-14.43891,-10.407475,0.931274,1.818497,-0.757148,-11.271709,-6.881838
98,40,4,8,4,0,Arizona,Arizona,6392017,6392307,6408208,...,1.327489,5.24574,3.905473,6.219955,6.776501,3.290378,7.337279,6.123606,8.761352,9.335208
114,40,3,7,5,0,Arkansas,Arkansas,2915918,2915958,2922394,...,1.365312,-0.432402,-0.442153,-1.060966,-0.407735,2.326251,0.646395,0.682189,0.226168,0.901928
190,40,4,9,6,0,California,California,37253956,37254503,37334079,...,-1.148464,-1.163788,-1.339869,-0.862856,-1.981572,2.761704,2.647127,2.728645,3.743342,2.656065
249,40,4,8,8,0,Colorado,Colorado,5029196,5029324,5048254,...,5.232828,5.513416,6.903846,7.375183,10.073656,6.968908,7.698805,9.053163,9.764255,12.537918
314,40,1,1,9,0,Connecticut,Connecticut,3574097,3574118,3579717,...,-3.7244,-5.311208,-4.730271,-7.567093,-7.687268,0.77551,-1.066084,-0.278693,-2.376831,-2.463243
323,40,3,5,10,0,Delaware,Delaware,897934,897936,899791,...,2.867721,3.594491,3.169689,5.106051,4.490138,5.332723,6.360496,5.939911,8.245757,7.628452
327,40,3,5,11,0,District of Columbia,District of Columbia,601723,601767,605126,...,11.560071,10.052444,9.457678,1.480094,5.601833,17.028422,15.972111,15.63412,8.378037,12.434838
329,40,3,5,12,0,Florida,Florida,18801310,18804623,18849890,...,5.556307,5.138184,4.84742,6.958576,10.080932,11.364173,10.67603,10.565882,13.47857,16.528676


In [7]:
df = df[df['SUMLEV'] == 50] #grabbing all rows where SUMLEV is 50, gets all the county data
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [8]:
# We can set a multilevel index just by passing a list of things we want to index on
df = df.set_index(['STNAME', 'CTYNAME']) #creating a multilevel index, indexing by state first than county
df.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411
Alabama,Bullock County,50,3,6,1,11,10914,10915,10887,10629,10606,...,-30.953709,-5.180127,-1.130263,14.35429,-16.167247,-29.001673,-2.825524,1.507017,17.24379,-13.193961
Alabama,Butler County,50,3,6,1,13,20947,20946,20944,20673,20408,...,-14.032727,-11.684234,-5.655413,1.085428,-6.529805,-13.936612,-11.586865,-5.557058,1.184103,-6.430868
Alabama,Calhoun County,50,3,6,1,15,118572,118586,118437,117768,117286,...,-6.15567,-4.611706,-5.524649,-4.463211,-3.376322,-5.791579,-4.092677,-5.062836,-3.912834,-2.806406
Alabama,Chambers County,50,3,6,1,17,34215,34170,34098,33993,34075,...,-2.731639,3.849092,2.872721,-2.287222,1.349468,-1.821092,4.701181,3.781439,-1.290228,2.346901
Alabama,Cherokee County,50,3,6,1,19,25989,25986,25976,26080,26023,...,6.339327,1.11318,5.488706,-0.076806,-3.239866,6.416167,1.420264,5.757384,0.230419,-2.931307


* Querying gets, frankly, complex
* `df.loc[row, column]`
* But with a multiindex we can do
  * `df.loc[row index1, row index2]`

In [9]:
df.loc['Missouri', 'St. Louis County'] #grab for a particular county the information. Missouri is row index 1, and St. Louis County is row index 2

SUMLEV          50.000000
REGION           2.000000
DIVISION         4.000000
STATE           29.000000
COUNTY         189.000000
                  ...    
RNETMIG2011     -1.813988
RNETMIG2012     -0.777081
RNETMIG2013     -1.395556
RNETMIG2014     -0.731726
RNETMIG2015     -0.935404
Name: (Missouri, St. Louis County), Length: 98, dtype: float64

In [10]:
df.loc['Missouri', 'St. Louis County']['REGION'] #returns the number for the Region of the state and county. Missouri is row index 1, and St. Louis County is row index 2 while region is column

2.0

In [11]:
df.loc['Missouri', 'REGION'] #gets region of each county in the state. This time Missouri is row and region is column

CTYNAME
Adair County       2
Andrew County      2
Atchison County    2
Audrain County     2
Barry County       2
                  ..
Wayne County       2
Webster County     2
Worth County       2
Wright County      2
St. Louis city     2
Name: REGION, Length: 115, dtype: int64

In [52]:
df.loc['Missouri', 'REGION']['St. Louis County']  #this time missouri is the row and REGION is the column and then go into St. Louis County

2

In [54]:
# It's a bit ambiguous; I recommend passing keys as tuple instead
df.loc[('Michigan', 'Washtenaw County')] #returns data for county in state

SUMLEV          50.000000
REGION           2.000000
DIVISION         3.000000
STATE           26.000000
COUNTY         161.000000
                  ...    
RNETMIG2011      5.191395
RNETMIG2012      1.248106
RNETMIG2013      4.226778
RNETMIG2014      3.801394
RNETMIG2015      0.595048
Name: (Michigan, Washtenaw County), Length: 98, dtype: float64

#### Which county has the largest population in Michigan?

In [58]:
df.loc[('Michigan')].sort_values('CENSUS2010POP', ascending = False).head(1) #takes the information for michigan and sorts the information by CENSUS2010POP then reverses order from greatest to smallest because ascending is False than uses head to return the first row 

Unnamed: 0_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
CTYNAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Wayne County,50,2,3,26,163,1820584,1820641,1815199,1801273,1792514,...,-13.340073,-10.271616,-14.119617,-11.903253,-8.762835,-11.344758,-8.098421,-11.732437,-9.161648,-6.010195
