# ch04 데이터의 부분 집합 선택

- Series나 DataFrame 데이터의 각 차원은 index 객체에 의해 레이블되어 있다.
- 바로 이 index가 pandas의 데이터 구조가 Numpy의 n차원 배열과 구분되는 점이다.
- index는 데이터의 row와 col에 의미 있는 레이블을 제공하고 pandas 사용자는 이 레이블을 사용해 데이터를 선택할 수 있다.
- pandas는 row와 col의 위치를 정수로 지정해 데이터를 선택할 수 있게 해준다.
- 이러한 이중적인 기능은 강력한 기능이지만, 데이터의 부분 집합을 선택할 때 다소 문법 상의 혼동을 일으키기도 한다.

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

## 1. Series 데이터 선택

- Series와 DataFrame은 indexing operator가 다수의 속성을 사용해 서로 다른 방식으로 데이터를 선택할 수 있게 할 수 있는 기능을 가진 복잡한 데이터 저장소다.
- indexing operator 외에도 .iloc와 .loc 속성이 제공된다.
- 이런 속성을 모두 합쳐 indexers라고 부른다.
- s[itme]: indexing operator
- s.loc[item]: .loc indexer

- .iloc indexer는 정수 위치만을 사용하여 선택하며 파이썬의 list와 비슷하게 작동한다.
- .loc indexer는 레이블만을 통해 선택하고 파이썬의 딕셔너리와 비슷하게 작동한다.

- 대학 데이터셋을 학교 이름을 index로 읽어 들이고 indexing operator를 사용해 단일 col Series로 만든다.

In [2]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [3]:
city = college['CITY']

In [4]:
city.head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

- .iloc indexer는 오직 정수 위치로만 데이터를 선택할 수 있다.
- 정수를 전달하면 스칼라 값을 반환한다.

In [5]:
city.iloc[3]

'Huntsville'

- 여러 개의 정수를 list로 만들어 .iloc에 전달하면 Series를 반환한다.

In [6]:
city.iloc[[3, 4, 5]]

INSTNM
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
The University of Alabama              Tuscaloosa
Name: CITY, dtype: object

- 슬라이스 표기 방식을 이용하면 동일한 간격으로 데이터를 선택할 수 있다.

In [7]:
city.iloc[4:50:10]

INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

- .loc indexer는 오직 레이블로만 선택한다.
- 단일 레이블을 전달하면 스칼라 값을 반환한다.

In [8]:
city.loc['Heritage Christian University']

'Florence'

- .iloc와 마찬가지로 레이블을 list로 만들어서 .loc에 전달하면 Series를 반환한다.

In [9]:
np.random.seed(1)

In [10]:
labels = list(np.random.choice(city.index, 4))

In [11]:
labels

['Northwest HVAC/R Training Center',
 'California State University-Dominguez Hills',
 'Lower Columbia College',
 'Southwest Acupuncture College-Boulder']

In [12]:
city.loc[labels]

INSTNM
Northwest HVAC/R Training Center                Spokane
California State University-Dominguez Hills      Carson
Lower Columbia College                         Longview
Southwest Acupuncture College-Boulder           Boulder
Name: CITY, dtype: object

- .loc indexer도 슬라이스 표현을 사용할 수 있다.
- 시작, 끝 값이 문자열이라는 것에 주의하자.

In [13]:
city.loc['Alabama State University':
         'Reid State Technical College': 10]

INSTNM
Alabama State University              Montgomery
Enterprise State Community College    Enterprise
Heritage Christian University           Florence
Marion Military Institute                 Marion
Reid State Technical College           Evergreen
Name: CITY, dtype: object

- 단일 원소를 선탁하고 그 결과를 Series에 보관하려면 스칼라 값을 전달하는 대신 단일 원소를 가진 list를 전달하면 된다.

## 2. DataFrame row 선택

- DataFrame의 row를 선택할 때 가장 선호되고 명시적인 방법은 .iloc, .loc indexer를 사용하는 것이다.

- 대학 데이터셋을 읽어 들이고 학교 이름을 index로 설정한다.

In [14]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [15]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


- .iloc indexer에 정수를 전달하면 그 위치에 있는 전체 row를 선택한다.

In [16]:
college.iloc[0]

CITY                  Normal
STABBR                    AL
HBCU                       1
MENONLY                    0
WOMENONLY                  0
RELAFFIL                   0
SATVRMID                 424
SATMTMID                 420
DISTANCEONLY               0
UGDS                    4206
UGDS_WHITE            0.0333
UGDS_BLACK            0.9353
UGDS_HISP             0.0055
UGDS_ASIAN            0.0019
UGDS_AIAN             0.0024
UGDS_NHPI             0.0019
UGDS_2MOR                  0
UGDS_NRA              0.0059
UGDS_UNKN             0.0138
PPTUG_EF              0.0656
CURROPER                   1
PCTPELL               0.7356
PCTFLOAN              0.8284
UG25ABV               0.1049
MD_EARN_WNE_P10        30300
GRAD_DEBT_MDN_SUPP     33888
Name: Alabama A & M University, dtype: object

- .loc indexer에 레이블 인덱스를 전달하면 동일한 값을 얻어올 수 있다.

In [17]:
college.loc['Alabama A & M University']

CITY                  Normal
STABBR                    AL
HBCU                       1
MENONLY                    0
WOMENONLY                  0
RELAFFIL                   0
SATVRMID                 424
SATMTMID                 420
DISTANCEONLY               0
UGDS                    4206
UGDS_WHITE            0.0333
UGDS_BLACK            0.9353
UGDS_HISP             0.0055
UGDS_ASIAN            0.0019
UGDS_AIAN             0.0024
UGDS_NHPI             0.0019
UGDS_2MOR                  0
UGDS_NRA              0.0059
UGDS_UNKN             0.0138
PPTUG_EF              0.0656
CURROPER                   1
PCTPELL               0.7356
PCTFLOAN              0.8284
UG25ABV               0.1049
MD_EARN_WNE_P10        30300
GRAD_DEBT_MDN_SUPP     33888
Name: Alabama A & M University, dtype: object

- 다수의 독립된 row를 선택하려면 iloc index에 정수로 구성된 list를 전달하면 된다.

In [18]:
college.iloc[[0, 1, 2, 0]]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0


- .loc에 정확한 대학명을 레이블 list로 전달하면 해당 DataFrame을 복제할 수 있다.

In [19]:
np.random.seed(10)

In [20]:
np.random.randint(len(college.index), size=4)

array([1289, 7293, 4623, 1344])

In [21]:
labels = college.index[np.random.randint(len(college.index), size=4)]

In [22]:
labels

Index(['Greenville Technical College', 'Ivy Tech Community College-Richmond',
       'Goshen College', 'Chillicothe Beauty Academy Inc'],
      dtype='object', name='INSTNM')

In [23]:
college.loc[labels]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Greenville Technical College,Greenville,SC,0.0,0.0,0.0,0,,,0.0,11186.0,...,0.0224,0.0023,0.0427,0.5681,1,0.6139,0.4705,0.413,30000,18715
Ivy Tech Community College-Richmond,Richmond,IN,,,,1,,,,,...,,,,,1,,,,29400,13000
Goshen College,Goshen,IN,0.0,0.0,0.0,1,548.0,553.0,0.0,746.0,...,0.0335,0.0952,0.0067,0.0603,1,0.308,0.599,0.1232,37800,20671
Chillicothe Beauty Academy Inc,Chillicothe,MO,0.0,0.0,0.0,0,,,0.0,15.0,...,0.0,0.0,0.0,0.0,1,0.4138,0.2069,0.0,PrivacySuppressed,PrivacySuppressed


- .iloc, .loc 모두 슬라이스 표기를 사용할 수 있다.

In [24]:
college.iloc[0:2]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [25]:
start = 'Alabama A & M University'
stop = 'University of Alabama at Birmingham'
college.loc[start:stop]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [26]:
college.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


- DataFrame에서 iloc로 row를 선택
- index로 index label을 선택
- tolist()로 list로 변환

In [27]:
college.iloc[[0, 1, 2]].index.tolist()

['Alabama A & M University',
 'University of Alabama at Birmingham',
 'Amridge University']

## 3. DataFrame row, col 동시에 선택

- col, row를 동시에 선택하려면 .iloc, .loc indexer에 쉼표로 구분된 정확한 col, row를 동시에 전달해야만 한다.
- col, row를 지정하는 값은 스칼라, list, 슬라이스 객체, 불리언 배열일 수 있다.

- 대학 데이터셋을 읽어 들이고 대학 이름을 index로 사용한다.
- 처음 3개 row, 처음 4개 col을 선택한다.

In [28]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [29]:
college.iloc[:3, :4]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama A & M University,Normal,AL,1.0,0.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0
Amridge University,Montgomery,AL,0.0,0.0


In [30]:
college.loc[:'Amridge University', :'MENONLY']

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama A & M University,Normal,AL,1.0,0.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0
Amridge University,Montgomery,AL,0.0,0.0


- 서로 다른 2개 col의 모든 row를 선택한다.

In [31]:
college.iloc[:, [4, 6]].head()

Unnamed: 0_level_0,WOMENONLY,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,0.0,424.0
University of Alabama at Birmingham,0.0,570.0
Amridge University,0.0,
University of Alabama in Huntsville,0.0,595.0
Alabama State University,0.0,425.0


In [33]:
college.loc[:, ['WOMENONLY', 'SATVRMID']].head()

Unnamed: 0_level_0,WOMENONLY,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,0.0,424.0
University of Alabama at Birmingham,0.0,570.0
Amridge University,0.0,
University of Alabama in Huntsville,0.0,595.0
Alabama State University,0.0,425.0


- 단일 스칼라 값을 선택한다.

In [35]:
college.iloc[4, -4]

0.7554

In [36]:
college.loc['Alabama State University', 'PCTFLOAN']

0.7554

- row를 슬라이스하여 단일 col을 선택한다.

In [37]:
college.iloc[90:80:-2, 5]

INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64

In [38]:
start = 'Empire Beauty School-Flagstaff'
stop = 'Arizona State University-Tempe'
college.loc[start:stop:-2, 'RELAFFIL']

INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64

- 모든 col, 일부 row만 선택할 때는 쉼표 다음에 콜론을 반드시 사용하지 않아도 된다.
- 쉼표가 없을 때의 디폴트 작동은 모든 col을 선택하는 것이다.
- 모든 col을 슬라이스 표기로 나타낼 수도 있지만 두 방식은 동일하다.

In [40]:
college.iloc[:10].head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [41]:
college.iloc[:10, :].head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


## 4. 정수와 레이블을 동시에 사용하여 데이터 선택

- pandas 초기 버전에서는 .ix indexer를 이용하여 정수와 레이블을 동시에 사용할 수 있었다.
- .ix indexer는 사용하지 않도록 한다.
- 현재는 row를 정수 기반으로 찾은 후 .iloc를 사용하여 선택한다.

In [42]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [44]:
col_start = college.columns.get_loc('UGDS_WHITE')

In [45]:
col_end = college.columns.get_loc('UGDS_UNKN') + 1

In [46]:
col_start, col_end

(10, 19)

In [47]:
college.iloc[:5, col_start:col_end]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


- columns attribute는 col index를 반환한다.
- index는 get_loc() 메서드를 가지고 있고, index label을 인자로 받아 해당하는 정수 위치를 반환한다.
- iloc는 마지막 원소를 포함하지 않으므로 슬라이스할 때 1을 더해야 한다.

- .loc가 정수와 label를 모두 사용하여 비슷한 연산을 하도록 할 수 있다.

In [48]:
row_start = college.index[10]
row_end = college.index[15]

In [49]:
college.loc[row_start:row_end, 'UGDS_WHITE':'UGDS_UNKN']

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Birmingham Southern College,0.7983,0.1102,0.0195,0.0517,0.0102,0.0,0.0051,0.0,0.0051
Chattahoochee Valley Community College,0.4661,0.4372,0.0492,0.0127,0.0023,0.0035,0.0151,0.0,0.0139
Concordia College Alabama,0.028,0.8758,0.0373,0.0093,0.0,0.0,0.0031,0.0466,0.0
South University-Montgomery,0.3046,0.6054,0.0153,0.0153,0.0153,0.0096,0.0,0.0019,0.0326
Enterprise State Community College,0.6408,0.2435,0.0509,0.0202,0.0081,0.0029,0.0254,0.0012,0.0069
James H Faulkner State Community College,0.6979,0.2259,0.032,0.0084,0.0177,0.0014,0.0152,0.0007,0.0009


- .iloc와 .loc를 chaining하면 같은 결과를 얻을 수 있으나, 일반적으로 indexer를 chaining하는 것은 좋지 않다.

## 5. 스칼라 선택 더 빠르게 하기

- .iloc와 .loc indexer 모두 Series나 DataFrame으로부터 스칼라 값을 선택할 수 있다.
- 그러나 .iat, .at indexer를 사용하면 동일한 결과를 더 빠르게 수행할 수 있다.

In [50]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

In [51]:
cn = 'Concordia College Alabama'

In [52]:
college.loc[cn, 'UGDS_WHITE']

0.028

In [53]:
college.at[cn, 'UGDS_WHITE']

0.028

In [54]:
%timeit college.loc[cn, 'UGDS_WHITE']

8.2 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [55]:
%timeit college.at[cn, 'UGDS_WHITE']

5.4 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [56]:
row_num = college.index.get_loc(cn)

In [57]:
col_num = college.columns.get_loc('UGDS_WHITE')

In [58]:
row_num, col_num

(12, 10)

In [60]:
%timeit college.iloc[row_num, col_num]

9.34 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [61]:
%timeit college.iat[row_num, col_num]

6.08 µs ± 8.15 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


- .iat, .at을 스칼라 indexer라고 부른다.
- 둘 모두 Series에서도 잘 동작한다.

In [63]:
state = college['STABBR']

In [64]:
state.iat[1000]

'IL'

In [65]:
state.at['Stanford University']

'CA'

- timeit 명령어는 2개의 %로 시작하면 코드 블록 전체에 대한 시간을 측정한다.

## 6. row를 게으르게 슬라이스

- row를 선택할 때 index operator만 사용하여 더 간단히 할 수 있는 방법이 있다.
- 하지만 index operator의 가장 중요한 기능은 DataFrame의 col을 선택하는 것이다.
- row를 선택하려면 모호성이 없도록 .iloc나 .loc를 사용하는 것이 가장 좋다.

In [66]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')

- index operator에 슬라이스 객체를 전달한다.

In [67]:
college[10:20:2]

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Birmingham Southern College,Birmingham,AL,0.0,0.0,0.0,1,560.0,560.0,0.0,1180.0,...,0.0051,0.0,0.0051,0.0017,1,0.192,0.4809,0.0152,44200.0,27000
Concordia College Alabama,Selma,AL,1.0,0.0,0.0,1,420.0,400.0,0.0,322.0,...,0.0031,0.0466,0.0,0.1056,1,0.8667,0.9333,0.2367,19900.0,PrivacySuppressed
Enterprise State Community College,Enterprise,AL,0.0,0.0,0.0,0,,,0.0,1729.0,...,0.0254,0.0012,0.0069,0.3823,1,0.4895,0.2263,0.3399,24600.0,8273
Faulkner University,Montgomery,AL,0.0,0.0,0.0,1,,,0.0,2367.0,...,0.0173,0.0182,0.0258,0.2302,1,0.5812,0.7253,0.4589,37200.0,22000
New Beginning College of Cosmetology,Albertville,AL,0.0,0.0,0.0,0,,,0.0,115.0,...,0.0,0.0,0.0,0.0783,1,0.8224,0.8553,0.3933,,5500








## 7. 사전 순서대로 슬라이스