---
# Selecting Subsets of Data
---

## Selecting Series data
Series and DataFrames allow selection by position (like Python lists) and by label (like Python
dictionaries). When we index off of the `.iloc` attribute, pandas selects only by position and
works similarly to Python lists. The `.loc` attribute selects only by index label, which is similar
to how Python dictionaries work.

In [1]:
import numpy as np
import pandas as pd

In [2]:
college = pd.read_csv('./college.csv', index_col='INSTNM')
college.columns

Index(['CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [3]:
city = college['CITY']
city

INSTNM
Alabama A & M University                                            Normal
University of Alabama at Birmingham                             Birmingham
Amridge University                                              Montgomery
University of Alabama in Huntsville                             Huntsville
Alabama State University                                        Montgomery
                                                                ...       
SAE Institute of Technology  San Francisco                      Emeryville
Rasmussen College - Overland Park                            Overland Park
National Personal Training Institute of Cleveland         Highland Heights
Bay Area Medical Academy - San Jose Satellite Location            San Jose
Excel Learning Center-San Antonio South                        San Antonio
Name: CITY, Length: 7535, dtype: object

Pull out a scalar value from the Series directly

In [4]:
city['Amridge University']

'Montgomery'

Pull out a scalar value using the `.loc` attribute by name

In [5]:
city.loc['Amridge University']

'Montgomery'

Pull out a scalar value using the `.iloc` attribute by position:

In [6]:
city.iloc[2]

'Montgomery'

Pull out several values by indexing. Note that if we pass in a list to the index
operation, pandas will now return a Series instead of a scalar

In [10]:
city[['Amridge University', 'University of Alabama in Huntsville', 'Rasmussen College - Overland Park']]

INSTNM
Amridge University                        Montgomery
University of Alabama in Huntsville       Huntsville
Rasmussen College - Overland Park      Overland Park
Name: CITY, dtype: object

Repeat the above using `.loc`

In [11]:
city.loc[['Amridge University', 'University of Alabama in Huntsville', 'Rasmussen College - Overland Park']]

INSTNM
Amridge University                        Montgomery
University of Alabama in Huntsville       Huntsville
Rasmussen College - Overland Park      Overland Park
Name: CITY, dtype: object

Repeat the above using `.iloc`

In [12]:
city.iloc[[2, 4]]

INSTNM
Amridge University          Montgomery
Alabama State University    Montgomery
Name: CITY, dtype: object

Use a slice to pull out many values

In [14]:
city['Alabama State University': 'Auburn University at Montgomery']

INSTNM
Alabama State University                 Montgomery
The University of Alabama                Tuscaloosa
Central Alabama Community College    Alexander City
Athens State University                      Athens
Auburn University at Montgomery          Montgomery
Name: CITY, dtype: object

Use a slice to pull out many values by position:

In [15]:
city.iloc[0: 4]

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

Use a Boolean array to pull out certain values

In [17]:
alabama_mask = city.isin(["Birmingham", "Montgomery"])
city[alabama_mask]

INSTNM
University of Alabama at Birmingham                 Birmingham
Amridge University                                  Montgomery
Alabama State University                            Montgomery
Auburn University at Montgomery                     Montgomery
Birmingham Southern College                         Birmingham
South University-Montgomery                         Montgomery
Faulkner University                                 Montgomery
Herzing University-Birmingham                       Birmingham
Huntingdon College                                  Montgomery
Jefferson State Community College                   Birmingham
Lawson State Community College-Birmingham Campus    Birmingham
Samford University                                  Birmingham
Southeastern Bible College                          Birmingham
H Councill Trenholm State Community College         Montgomery
West Virginia University Institute of Technology    Montgomery
Virginia College-Birmingham                     

## Selecting DataFrame rows
select rows from a DataFrame using the `.iloc` and `.loc`
indexers

To select an entire row at that position, pass an integer to `.iloc`:

In [18]:
college.iloc[60]

CITY                  Anchorage
STABBR                       AK
HBCU                          0
MENONLY                       0
WOMENONLY                     0
RELAFFIL                      0
SATVRMID                    NaN
SATMTMID                    NaN
DISTANCEONLY                  0
UGDS                      12865
UGDS_WHITE               0.5747
UGDS_BLACK               0.0358
UGDS_HISP                0.0761
UGDS_ASIAN               0.0778
UGDS_AIAN                0.0653
UGDS_NHPI                0.0086
UGDS_2MOR                 0.098
UGDS_NRA                 0.0181
UGDS_UNKN                0.0457
PPTUG_EF                 0.4539
CURROPER                      1
PCTPELL                  0.2385
PCTFLOAN                 0.2647
UG25ABV                  0.4386
MD_EARN_WNE_P10           42500
GRAD_DEBT_MDN_SUPP      19449.5
Name: University of Alaska Anchorage, dtype: object

To get the same row as the preceding step, pass the index label to `.loc`

In [19]:
college.loc["University of Alaska Anchorage"]

CITY                  Anchorage
STABBR                       AK
HBCU                          0
MENONLY                       0
WOMENONLY                     0
RELAFFIL                      0
SATVRMID                    NaN
SATMTMID                    NaN
DISTANCEONLY                  0
UGDS                      12865
UGDS_WHITE               0.5747
UGDS_BLACK               0.0358
UGDS_HISP                0.0761
UGDS_ASIAN               0.0778
UGDS_AIAN                0.0653
UGDS_NHPI                0.0086
UGDS_2MOR                 0.098
UGDS_NRA                 0.0181
UGDS_UNKN                0.0457
PPTUG_EF                 0.4539
CURROPER                      1
PCTPELL                  0.2385
PCTFLOAN                 0.2647
UG25ABV                  0.4386
MD_EARN_WNE_P10           42500
GRAD_DEBT_MDN_SUPP      19449.5
Name: University of Alaska Anchorage, dtype: object

To select a disjointed set of rows as a DataFrame, pass a list of integers to `.iloc`

In [20]:
college.iloc[[60, 99, 3]]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
University of Alaska Anchorage,Anchorage,AK,0.0,0.0,0.0,0,,,0.0,12865.0,0.5747,0.0358,0.0761,0.0778,0.0653,0.0086,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500,19449.5
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,0.2713,0.25,0.367,0.016,0.016,0.0,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0


The same DataFrame from step 4 may be reproduced with `.loc` by passing it a list of
the institution names:

In [21]:
labels = ["University of Alaska Anchorage", 
          "International Academy of Hair Design", 
          "University of Alabama in Huntsville",
          ]
college.loc[labels]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
University of Alaska Anchorage,Anchorage,AK,0.0,0.0,0.0,0,,,0.0,12865.0,0.5747,0.0358,0.0761,0.0778,0.0653,0.0086,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500,19449.5
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,0.2713,0.25,0.367,0.016,0.016,0.0,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0


Use slice notation with `.iloc` to select contiguous rows of the data:

In [22]:
college.iloc[99:102]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,0.2713,0.25,0.367,0.016,0.016,0.0,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556
GateWay Community College,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,5211.0,0.3585,0.1201,0.3389,0.0355,0.0451,0.0029,0.0127,0.0161,0.0702,0.7465,1,0.327,0.2189,0.5832,29800,7283
Mesa Community College,Mesa,AZ,0.0,0.0,0.0,0,,,0.0,19055.0,0.5002,0.0661,0.2354,0.039,0.0403,0.0046,0.0205,0.0257,0.0682,0.6457,1,0.3423,0.2207,0.401,35200,8000


Slice notation also works with `.loc` and is a closed interval (it includes both the start
label and the stop label)

In [23]:
start = "International Academy of Hair Design"
stop = "Mesa Community College"
college.loc[start:stop]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
International Academy of Hair Design,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,188.0,0.2713,0.25,0.367,0.016,0.016,0.0,0.016,0.0,0.0638,0.0,0,0.7185,0.7346,0.3905,22200,10556
GateWay Community College,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,5211.0,0.3585,0.1201,0.3389,0.0355,0.0451,0.0029,0.0127,0.0161,0.0702,0.7465,1,0.327,0.2189,0.5832,29800,7283
Mesa Community College,Mesa,AZ,0.0,0.0,0.0,0,,,0.0,19055.0,0.5002,0.0661,0.2354,0.039,0.0403,0.0046,0.0205,0.0257,0.0682,0.6457,1,0.3423,0.2207,0.401,35200,8000


In [24]:
college.iloc[[60, 30, -1]].index.to_list()

['University of Alaska Anchorage',
 'Judson College',
 'Excel Learning Center-San Antonio South']

## Selecting DataFrame rows and columns simultaneously
The generic form to select rows and columns:  

`df.iloc[row_idxs, column_idxs]`  
`df.loc[row_names, column_names]`  

Where row_idxs and column_idxs can be scalar integers, lists of integers, or integer
slices. While row_names and column_names can be the scalar names, lists of names,
or names slices, row_names can also be a Boolean array

Select the first three rows and the first four columns with slice notation

In [26]:
college.iloc[:3, :4]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama A & M University,Normal,AL,1.0,0.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0
Amridge University,Montgomery,AL,0.0,0.0


Select all the rows of two different columns:

In [27]:
college.loc[:, ['CITY', 'MENONLY']]

Unnamed: 0_level_0,CITY,MENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,Normal,0.0
University of Alabama at Birmingham,Birmingham,0.0
Amridge University,Montgomery,0.0
University of Alabama in Huntsville,Huntsville,0.0
Alabama State University,Montgomery,0.0
...,...,...
SAE Institute of Technology San Francisco,Emeryville,
Rasmussen College - Overland Park,Overland Park,
National Personal Training Institute of Cleveland,Highland Heights,
Bay Area Medical Academy - San Jose Satellite Location,San Jose,


Select disjointed rows and columns

In [28]:
college.iloc[[1, -1], [0, 4]]

Unnamed: 0_level_0,CITY,WOMENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Alabama at Birmingham,Birmingham,0.0
Excel Learning Center-San Antonio South,San Antonio,


Select a single scalar value

In [29]:
college.iloc[1, 0]

'Birmingham'

In [30]:
college.loc['University of Alabama at Birmingham', 'CITY']

'Birmingham'

Slice the rows and select a single column:

In [31]:
college.iloc[:2, -1]

INSTNM
Alabama A & M University                 33888
University of Alabama at Birmingham    21941.5
Name: GRAD_DEBT_MDN_SUPP, dtype: object

In [32]:
college.loc[:'University of Alabama at Birmingham', 'GRAD_DEBT_MDN_SUPP']

INSTNM
Alabama A & M University                 33888
University of Alabama at Birmingham    21941.5
Name: GRAD_DEBT_MDN_SUPP, dtype: object