# ch08. 정돈된 형태로 데이터 재구성

- 데이터 재구성을 위한 프로세스를 가리키는 몇 가지 용어가 있지만, 데이터 과학자들이 가장 즐겨 쓰는 용어는 정돈된 데이터 tidy data 다.
- 데이터가 정돈된 것인지 판단할 수 있는 세 가지 원칙
    - 각 변수는 col을 형성한다.
    - 각 관측값은 row를 형성한다.
    - 각 관측 단위별로 별도의 테이블이 구성된다.

- 변수, 관측값, 관측 단위가 무엇인지부터 알아야 한다.

- 변수가 실제로 무엇인지 직관을 얻기 위해 변수 이름과 변수값에 대한 구분을 생각해보자.
- 변수 이름이란, 성, 인종, 연봉, 직위 같은 레이블이고, 변수값이단 성에 대해서는 남자/여자, 인종에 대해서는 백인/흑인 등으로 매 관측 때마다 달라지는 것이다.

- 단일 관측이단, 단일 관측 단위에 대한 모든 변수값의 모음이다.
- 관측 단위에 대한 이해를 돕기 위해 각 거래 내역, 종업원, 손님, 물품, 가게 자체에 대한 데이터를 갖고 있는 소매 상점을 생각해보자.
- 이것들은 모두 관찰 단위로 생각할 수 있고 그 자체의 테이블이 필요하다.
- 종업원 정보와 고객 정보를 같은 테이블에 병합하는 것은 정돈된 데이터의 원칙을 위배하는 것이다.

- pandas에서 정돈을 위해 제공하는 주요 도구는 DataFrame 메서드인 stack, melt, unstack, pivot이다.

In [3]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

## 1.stack을 이용해 변수값을 열 이름으로 정돈

In [4]:
state_fruit = pd.read_csv('../data/state_fruit.csv', index_col=0)

In [5]:
state_fruit

Unnamed: 0,Apple,Orange,Banana
Texas,12,10,40
Arizona,9,7,12
Florida,0,14,190


- 이 데이터는 정돈되어 있지 않다.

- DataFrame의 index는 주 이름이다.
- 이 주 이름은 수직으로 잘 위채해 있어서 재구성할 필요가 겂다.
- 문제는 col 이름이다.
- stack() 메서드는 모든 col 이름을 받아 단일 index 레벨로 수직으로 재구성한다.

In [6]:
state_fruit.stack()

Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64

In [7]:
state_fruit_tidy = state_fruit.stack().reset_index()

In [8]:
state_fruit_tidy

Unnamed: 0,level_0,level_1,0
0,Texas,Apple,12
1,Texas,Orange,10
2,Texas,Banana,40
3,Arizona,Apple,9
4,Arizona,Orange,7
5,Arizona,Banana,12
6,Florida,Apple,0
7,Florida,Orange,14
8,Florida,Banana,190


In [9]:
state_fruit_tidy.columns = ['state', 'fruit', 'weight']

In [10]:
state_fruit_tidy

Unnamed: 0,state,fruit,weight
0,Texas,Apple,12
1,Texas,Orange,10
2,Texas,Banana,40
3,Arizona,Apple,9
4,Arizona,Orange,7
5,Arizona,Banana,12
6,Florida,Apple,0
7,Florida,Orange,14
8,Florida,Banana,190


- stack을 사용할 때의 핵심 중 하나는 index를 변환하지 않는다는 것이다.
- 이 예제에서 주를 인덱스로 지정하지 않는다면?

In [11]:
state_fruit2 = pd.read_csv('../data/state_fruit2.csv')

In [12]:
state_fruit2

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


In [13]:
state_fruit2.stack()

0  State       Texas
   Apple          12
   Orange         10
   Banana         40
1  State     Arizona
   Apple           9
   Orange          7
   Banana         12
2  State     Florida
   Apple           0
   Orange         14
   Banana        190
dtype: object

- 이 데이터를 정확히 재구성하려면 먼저 set_index() 메서드를 사용한 후 stack을 사용해야 한다.

In [15]:
state_fruit2.set_index('State').stack()

State          
Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64

## 2.melt를 사용해 변수값을 col 이름으로 정돈

In [16]:
state_fruit2 = pd.read_csv('../data/state_fruit2.csv')

In [17]:
state_fruit2

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


- melt 메서드를 사용하여 이전과 같이 변환해본다.

In [19]:
state_fruit2.melt(id_vars=['State'], value_vars=['Apple', 'Orange', 'Banana'])

Unnamed: 0,State,variable,value
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


- var_name과 value_name 2개의 매개변수를 더 활용할 수 있다.

In [20]:
state_fruit2.melt(id_vars=['State'],
                  value_vars=['Apple', 'Orange', 'Banana'],
                  var_name='Fruit',
                  value_name='Weight')

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


- melt는 모두 5개의 매개변수를 갖는데 데이터 재구성을 정확히 이해하려면, 이 중 다음 2개를 이해하는 것이 절대적으로 중요하다.
- id_vars는 재구성하지 않고 col로서 유지하고 싶은 이름의 리스트다.
- value_vars는 단일 col로 재구성하고 싶은 col들의 이름을 가진 리스트다.

- melt의 중요한 측면의 하나는 index의 값을 무시한다는 것이다.
- index를 조용히 제거하고 디폴트로 RangeIndex로 대체한다.

## 3.여러 변수 그룹을 동시에 stack 하기

- 영화 데이터셋을 읽어 모든 배우의 이름과 해당하는 페이스북 좋아요를 갖고 있는 모든 col을 선택해보자.

In [21]:
movie = pd.read_csv('../data/movie.csv')

In [23]:
actor = movie[['movie_title', 'actor_1_name', 'actor_2_name', 'actor_3_name',
               'actor_1_facebook_likes',
               'actor_2_facebook_likes',
               'actor_3_facebook_likes',]]

In [24]:
actor.head()

Unnamed: 0,movie_title,actor_1_name,actor_2_name,actor_3_name,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi,1000.0,936.0,855.0
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,40000.0,5000.0,1000.0
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman,11000.0,393.0,161.0
3,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,27000.0,23000.0,23000.0
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,Rob Walker,,131.0,12.0,


- 여기서 col을 영화 제먹, 배우 이름, 페이스북 좋아요 개수로 정의한다면, 두 열의 집합을 개별적으로 스택해야 하는데, 이는 단일 stack이나 melt 호출로는 불가능하다.
- wide_to_long 함수를 사용하여 동시에 stack 하는 방법을 사용한다.

In [29]:
actor.columns = ['movie_title', 'actor_1', 'actor_2', 'actor_3','fb_like_1', 'fb_like_2', 'fb_like_3',]

In [31]:
actor.head()

Unnamed: 0,movie_title,actor_1,actor_2,actor_3,fb_like_1,fb_like_2,fb_like_3
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi,1000.0,936.0,855.0
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,40000.0,5000.0,1000.0
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman,11000.0,393.0,161.0
3,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,27000.0,23000.0,23000.0
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,Rob Walker,,131.0,12.0,


In [32]:
stubs = ['actor', 'fb_like']
actor2_tidy = pd.wide_to_long(actor,
                              stubnames=stubs,
                              i=['movie_title'],
                              j='actor_num',
                              sep='_')

In [33]:
actor2_tidy.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,actor,fb_like
movie_title,actor_num,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,1,CCH Pounder,1000.0
Pirates of the Caribbean: At World's End,1,Johnny Depp,40000.0
Spectre,1,Christoph Waltz,11000.0
The Dark Knight Rises,1,Tom Hardy,27000.0
Star Wars: Episode VII - The Force Awakens,1,Doug Walker,131.0


## 4.스택된 데이터 되돌리기

- DataFrame은 두 가지 유사한 메서드인 stack과 melt가 있어서 수평 col 이름을 수직 col 값으로 변환할 수 있다.
- DataFrame은 이 두 연산을 각각 unstack과 pivot 메서드를 사용해 되돌릴 수 있다.
- stack/unstack은 melt/pivot 보다 더 단순한 메서드로 col/row 인덱스해 대해서만 조절하지만, melt/pivot은 어떤 열을 재구성할 것인지 선택할 수 있어 보다 많은 유연성을 가질 수 있다.

- 대학 데이터셋을 로드, 기관명을 인덱스로 설정하고 인종 col 그룹만 읽어 들인다.

In [36]:
usecol_func = lambda x: 'UGDS_' in x or x == 'INSTNM'

In [37]:
college = pd.read_csv('../data/college.csv',
                      index_col='INSTNM',
                      usecols=usecol_func)

In [38]:
college.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


- stack() 메서드를 사용해 각 수평 col을 수직 인덱스 레벨로 변환한다.

In [39]:
college_stacked = college.stack()

In [41]:
college_stacked.head(9)

INSTNM                              
Alabama A & M University  UGDS_WHITE    0.0333
                          UGDS_BLACK    0.9353
                          UGDS_HISP     0.0055
                          UGDS_ASIAN    0.0019
                          UGDS_AIAN     0.0024
                          UGDS_NHPI     0.0019
                          UGDS_2MOR     0.0000
                          UGDS_NRA      0.0059
                          UGDS_UNKN     0.0138
dtype: float64

- stack 된 데이터를 unstack Series() 메서드를 사용해 원래 형태로 되돌린다.

In [44]:
college_stacked.unstack().head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


- melt와 pivot을 순서대로 사용해 비슷한 연산을 할 수 있다.
- 먼저 인덱스에 기관 이름을 사용하지 않고 데이터를 읽어 들인다.

In [45]:
college2 = pd.read_csv('../data/college.csv',
                       usecols=usecol_func)

In [46]:
college2.head()

Unnamed: 0,INSTNM,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
1,University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
2,Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
3,University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
4,Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


- melt() 메서드를 써서 모든 인종 col을 단일 col로 전치한다.

In [47]:
college_melted = college2.melt(id_vars='INSTNM',
                               var_name='Race',
                               value_name='Percentage')

In [48]:
college_melted.head()

Unnamed: 0,INSTNM,Race,Percentage
0,Alabama A & M University,UGDS_WHITE,0.0333
1,University of Alabama at Birmingham,UGDS_WHITE,0.5922
2,Amridge University,UGDS_WHITE,0.299
3,University of Alabama in Huntsville,UGDS_WHITE,0.6988
4,Alabama State University,UGDS_WHITE,0.0158


- pivot 메서드를 사용하여 앞 결과를 되돌린다.

In [49]:
melted_inv = college_melted.pivot(index='INSTNM',
                                  columns='Race',
                                  values='Percentage')

In [50]:
melted_inv.head()

Race,UGDS_2MOR,UGDS_AIAN,UGDS_ASIAN,UGDS_BLACK,UGDS_HISP,UGDS_NHPI,UGDS_NRA,UGDS_UNKN,UGDS_WHITE
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A & W Healthcare Educators,0.0,0.0,0.0,0.975,0.025,0.0,0.0,0.0,0.0
A T Still University of Health Sciences,,,,,,,,,
ABC Beauty Academy,0.0,0.0,0.9333,0.0333,0.0333,0.0,0.0,0.0,0.0
ABC Beauty College Inc,0.0,0.0,0.0,0.6579,0.0526,0.0,0.0,0.0,0.2895
AI Miami International University of Art and Design,0.0018,0.0,0.0018,0.0198,0.4773,0.0,0.0025,0.4644,0.0324


- read_csv 함수의 usecols 매개변수는 임포트하려는 col의 이름을 받아들이거나 이를 동적으로 결정하는 함수를 받아들인다.
- 함수는 각 col 이름이 문자열로 전달되고 불리언을 반환해야 한다.
- 이 방법을 사용하면 대량의 메모리를 정략할 수 있다.

## 5.groupby로 aggregation한 후 unstack하기

- 임직원 데이터셋을 읽은 후, 인종별로 평균 급여를 계산한다.

In [51]:
employee = pd.read_csv('../data/employee.csv')

In [52]:
employee.groupby('RACE')['BASE_SALARY'].mean().astype(int)

RACE
American Indian or Alaskan Native    60272
Asian/Pacific Islander               61660
Black or African American            50137
Hispanic/Latino                      52345
Others                               51278
White                                64419
Name: BASE_SALARY, dtype: int32

- 성별로 각 인종별 평균 급여를 계산해보자.

In [53]:
agg = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY']\
              .mean().astype(int)

In [54]:
agg

RACE                               GENDER
American Indian or Alaskan Native  Female    60238
                                   Male      60305
Asian/Pacific Islander             Female    63226
                                   Male      61033
Black or African American          Female    48915
                                   Male      51082
Hispanic/Latino                    Female    46503
                                   Male      54782
Others                             Female    63785
                                   Male      38771
White                              Female    66793
                                   Male      63940
Name: BASE_SALARY, dtype: int32

- aggregation을 사용하면 복잡해지므로 재구성을 통해 서로 비교가 용이하도록 만들 수 있다.
- 성별 인덱스 레벨을 unstack해보자.

In [55]:
agg.unstack('GENDER')

GENDER,Female,Male
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1
American Indian or Alaskan Native,60238,60305
Asian/Pacific Islander,63226,61033
Black or African American,48915,51082
Hispanic/Latino,46503,54782
Others,63785,38771
White,66793,63940


- 유사하게 인종 인덱스 레벨은 unstack 할 수 있다.

In [56]:
agg.unstack('RACE')

RACE,American Indian or Alaskan Native,Asian/Pacific Islander,Black or African American,Hispanic/Latino,Others,White
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,60238,63226,48915,46503,63785,66793
Male,60305,61033,51082,54782,38771,63940


- aggregation col이 복수 개면 그 결과는 Series가 아니라 DataFrame이다.

In [57]:
agg2 = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY']\
               .agg(['mean', 'max', 'min']).astype(int)

In [58]:
agg2

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,max,min
RACE,GENDER,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
American Indian or Alaskan Native,Female,60238,98536,26125
American Indian or Alaskan Native,Male,60305,81239,26125
Asian/Pacific Islander,Female,63226,130416,26125
Asian/Pacific Islander,Male,61033,163228,27914
Black or African American,Female,48915,150416,24960
Black or African American,Male,51082,275000,26125
Hispanic/Latino,Female,46503,126115,26125
Hispanic/Latino,Male,54782,165216,26104
Others,Female,63785,63785,63785
Others,Male,38771,38771,38771


- GENDER col을 unstack 하면 MultiIndex col이 반환된다.

In [59]:
agg2.unstack('GENDER')

Unnamed: 0_level_0,mean,mean,max,max,min,min
GENDER,Female,Male,Female,Male,Female,Male
RACE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
American Indian or Alaskan Native,60238,60305,98536,81239,26125,26125
Asian/Pacific Islander,63226,61033,130416,163228,26125,27914
Black or African American,48915,51082,150416,275000,24960,26125
Hispanic/Latino,46503,54782,126115,165216,26125,26104
Others,63785,38771,63785,38771,63785,38771
White,66793,63940,178331,210588,27955,26125







## 6.groupby aggregation으로 pivot_table 복제
## 7.쉬운 재구성을 위한 축 레벨 재명명
## 8.복수 변수가 열 이름에 저장됐을 때의 정돈
## 9.복수 변수가 col값에 저장됬을 때의 정돈
## 10.2개 이상의 변수가 같은 셀에 저장됐을 때의 정돈
## 11.변수가 col 이름과 값으로 저장됐을 때의 정돈
## 12.복수 개의 관측 단위가 같은 테이블에 저장됐을 때의 정돈