# ch04 데이터의 부분 집합 선택

- Series나 DataFrame 데이터의 각 차원은 index 객체에 의해 레이블되어 있다.
- 바로 이 index가 pandas의 데이터 구조가 Numpy의 n차원 배열과 구분되는 점이다.
- index는 데이터의 row와 col에 의미 있는 레이블을 제공하고 pandas 사용자는 이 레이블을 사용해 데이터를 선택할 수 있다.
- pandas는 row와 col의 위치를 정수로 지정해 데이터를 선택할 수 있게 해준다.
- 이러한 이중적인 기능은 강력한 기능이지만, 데이터의 부분 집합을 선택할 때 다소 문법 상의 혼동을 일으키기도 한다.

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

## 1. Series 데이터 선택

- Series와 DataFrame은 indexing operator가 다수의 속성을 사용해 서로 다른 방식으로 데이터를 선택할 수 있게 할 수 있는 기능을 가진 복잡한 데이터 저장소다.
- indexing operator 외에도 .iloc와 .loc 속성이 제공된다.
- 이런 속성을 모두 합쳐 indexers라고 부른다.
- s[itme]: indexing operator
- s.loc[item]: .loc indexer

- .iloc indexer는 정수 위치만을 사용하여 선택하며 파이썬의 list와 비슷하게 작동한다.
- .loc indexer는 레이블만을 통해 선택하고 파이썬의 딕셔너리와 비슷하게 작동한다.

- 대학 데이터셋을 학교 이름을 index로 읽어 들이고 indexing operator를 사용해 단일 col Series로 만든다.

In [2]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [3]:
city = college['CITY']

In [4]:
city.head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

- .iloc indexer는 오직 정수 위치로만 데이터를 선택할 수 있다.
- 정수를 전달하면 스칼라 값을 반환한다.

In [5]:
city.iloc[3]

'Huntsville'

- 여러 개의 정수를 list로 만들어 .iloc에 전달하면 Series를 반환한다.

In [6]:
city.iloc[[3, 4, 5]]

INSTNM
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
The University of Alabama              Tuscaloosa
Name: CITY, dtype: object

- 슬라이스 표기 방식을 이용하면 동일한 간격으로 데이터를 선택할 수 있다.

In [7]:
city.iloc[4:50:10]

INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

- .loc indexer는 오직 레이블로만 선택한다.
- 단일 레이블을 전달하면 스칼라 값을 반환한다.

In [8]:
city.loc['Heritage Christian University']

'Florence'

- .iloc와 마찬가지로 레이블을 list로 만들어서 .loc에 전달하면 Series를 반환한다.

In [9]:
np.random.seed(1)

In [10]:
labels = list(np.random.choice(city.index, 4))

In [11]:
labels

['Northwest HVAC/R Training Center',
 'California State University-Dominguez Hills',
 'Lower Columbia College',
 'Southwest Acupuncture College-Boulder']

In [13]:
city.loc[labels]

INSTNM
Northwest HVAC/R Training Center                Spokane
California State University-Dominguez Hills      Carson
Lower Columbia College                         Longview
Southwest Acupuncture College-Boulder           Boulder
Name: CITY, dtype: object

- .loc indexer도 슬라이스 표현을 사용할 수 있다.
- 시작, 끝 값이 문자열이라는 것에 주의하자.

In [14]:
city.loc['Alabama State University':
         'Reid State Technical College': 10]

INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

- 단일 원소를 선탁하고 그 결과를 Series에 보관하려면 스칼라 값을 전달하는 대신 단일 원소를 가진 list를 전달하면 된다.

## 2. DataFrame row 선택

- DataFrame의 row를 선택할 때 가장 선호되고 명시적인 방법은 .iloc, .loc indexer를 사용하는 것이다.

- 대학 데이터셋을 읽어 들이고 학교 이름을 index로 설정한다.

In [16]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [17]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


- .iloc indexer에 정수를 전달하면 그 위치에 있는 전체 row를 선택한다.

In [18]:
college.iloc[0]

CITY                  Normal
STABBR                    AL
HBCU                       1
MENONLY                    0
WOMENONLY                  0
RELAFFIL                   0
SATVRMID                 424
SATMTMID                 420
DISTANCEONLY               0
UGDS                    4206
UGDS_WHITE            0.0333
UGDS_BLACK            0.9353
UGDS_HISP             0.0055
UGDS_ASIAN            0.0019
UGDS_AIAN             0.0024
UGDS_NHPI             0.0019
UGDS_2MOR                  0
UGDS_NRA              0.0059
UGDS_UNKN             0.0138
PPTUG_EF              0.0656
CURROPER                   1
PCTPELL               0.7356
PCTFLOAN              0.8284
UG25ABV               0.1049
MD_EARN_WNE_P10        30300
GRAD_DEBT_MDN_SUPP     33888
Name: Alabama A & M University, dtype: object

- .loc indexer에 레이블 인덱스를 전달하면 동일한 값을 얻어올 수 있다.

In [19]:
college.loc['Alabama A & M University']

CITY                  Normal
STABBR                    AL
HBCU                       1
MENONLY                    0
WOMENONLY                  0
RELAFFIL                   0
SATVRMID                 424
SATMTMID                 420
DISTANCEONLY               0
UGDS                    4206
UGDS_WHITE            0.0333
UGDS_BLACK            0.9353
UGDS_HISP             0.0055
UGDS_ASIAN            0.0019
UGDS_AIAN             0.0024
UGDS_NHPI             0.0019
UGDS_2MOR                  0
UGDS_NRA              0.0059
UGDS_UNKN             0.0138
PPTUG_EF              0.0656
CURROPER                   1
PCTPELL               0.7356
PCTFLOAN              0.8284
UG25ABV               0.1049
MD_EARN_WNE_P10        30300
GRAD_DEBT_MDN_SUPP     33888
Name: Alabama A & M University, dtype: object

- 다수의 독립된 row를 선택하려면 iloc index에 정수로 구성된 list를 전달하면 된다.

In [20]:
college.iloc[[0, 1, 2, 0]]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0


- .loc에 정확한 대학명을 레이블 list로 전달하면 해당 DataFrame을 복제할 수 있다.

In [37]:
np.random.seed(10)

In [38]:
np.random.randint(len(college.index), size=4)

array([1289, 7293, 4623, 1344])

In [39]:
labels = college.index[np.random.randint(len(college.index), size=4)]

In [40]:
labels

Index(['Greenville Technical College', 'Ivy Tech Community College-Richmond',
       'Goshen College', 'Chillicothe Beauty Academy Inc'],
      dtype='object', name='INSTNM')

In [41]:
college.loc[labels]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Greenville Technical College,Greenville,SC,0.0,0.0,0.0,0,,,0.0,11186.0,...,0.0224,0.0023,0.0427,0.5681,1,0.6139,0.4705,0.413,30000,18715
Ivy Tech Community College-Richmond,Richmond,IN,,,,1,,,,,...,,,,,1,,,,29400,13000
Goshen College,Goshen,IN,0.0,0.0,0.0,1,548.0,553.0,0.0,746.0,...,0.0335,0.0952,0.0067,0.0603,1,0.308,0.599,0.1232,37800,20671
Chillicothe Beauty Academy Inc,Chillicothe,MO,0.0,0.0,0.0,0,,,0.0,15.0,...,0.0,0.0,0.0,0.0,1,0.4138,0.2069,0.0,PrivacySuppressed,PrivacySuppressed


- .iloc, .loc 모두 슬라이스 표기를 사용할 수 있다.

In [47]:
college.iloc[0:2]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [49]:
start = 'Alabama A & M University'
stop = 'University of Alabama at Birmingham'
college.loc[start:stop]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [50]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


- DataFrame에서 iloc로 row를 선택
- index로 index label을 선택
- tolist()로 list로 변환

In [51]:
college.iloc[[0, 1, 2]].index.tolist()

['Alabama A & M University',
 'University of Alabama at Birmingham',
 'Amridge University']



## 3. DataFrame row, col 동시에 선택
## 4. 정수와 레이블로 데이터 선택
## 5. 스칼라 선택 더 빠르게 하기
## 6. row를 게으르게 슬라이스
## 7. 사전 순서대로 슬라이스