# 더미변수
- `명목형 변수`에 대한 `표준화 처리 방법`
- 보고서 작성 시 "더미변수"를 `통제 요인` 혹은 `통제 변수`라고 표기
- `patsy` 패키지 설치 필요

## 1. 작업 준비
### 패키지 참조 및 데이터 가져오기

In [6]:
from pandas import read_excel, DataFrame
from patsy import dmatrix
import numpy as np

In [7]:
df = read_excel('https://data.hossam.kr/C02/dum.xlsx')
df

Unnamed: 0,성별,비만도
0,남자,정상
1,여자,경도
2,여자,정상
3,남자,고도
4,남자,정상
5,남자,경도
6,남자,고도
7,여자,고도
8,여자,경도
9,남자,고도


### 명목형 변수 라벨링
- `dmatrix('컬럼명 + 0', 데이터프레임)`
- 컬럼이름을 지정하는 문자열에 항상 `+ 0` 을 추가해야 함

In [11]:
dmatrix('성별 + 0', df)

DesignMatrix with shape (20, 2)
  성별[남자]  성별[여자]
       1       0
       0       1
       0       1
       1       0
       1       0
       1       0
       1       0
       0       1
       0       1
       1       0
       0       1
       0       1
       0       1
       1       0
       0       1
       0       1
       1       0
       0       1
       0       1
       0       1
  Terms:
    '성별' (columns 0:2)

In [12]:
dmatrix('비만도 + 0', df)

DesignMatrix with shape (20, 3)
  비만도[경도]  비만도[고도]  비만도[정상]
        0        0        1
        1        0        0
        0        0        1
        0        1        0
        0        0        1
        1        0        0
        0        1        0
        0        1        0
        1        0        0
        0        1        0
        0        0        1
        0        0        1
        0        0        1
        0        0        1
        0        1        0
        1        0        0
        0        1        0
        1        0        0
        0        0        1
        1        0        0
  Terms:
    '비만도' (columns 0:3)

In [32]:
dp = dmatrix('성별 + 비만도 + 0', df)
dp

DesignMatrix with shape (20, 4)
  성별[남자]  성별[여자]  비만도[T.고도]  비만도[T.정상]
       1       0          0          1
       0       1          0          0
       0       1          0          1
       1       0          1          0
       1       0          0          1
       1       0          0          0
       1       0          1          0
       0       1          1          0
       0       1          0          0
       1       0          1          0
       0       1          0          1
       0       1          0          1
       0       1          0          1
       1       0          0          1
       0       1          1          0
       0       1          0          0
       1       0          1          0
       0       1          0          0
       0       1          0          1
       0       1          0          0
  Terms:
    '성별' (columns 0:2)
    '비만도' (columns 2:4)

In [33]:
dm = dmatrix('성별:비만도 + 0', df)
dm

DesignMatrix with shape (20, 6)
  Columns:
    ['성별[남자]:비만도[경도]',
     '성별[여자]:비만도[경도]',
     '성별[남자]:비만도[고도]',
     '성별[여자]:비만도[고도]',
     '성별[남자]:비만도[정상]',
     '성별[여자]:비만도[정상]']
  Terms:
    '성별:비만도' (columns 0:6)
  (to view full data, use np.asarray(this_obj))

In [34]:
dm.design_info.column_names

['성별[남자]:비만도[경도]',
 '성별[여자]:비만도[경도]',
 '성별[남자]:비만도[고도]',
 '성별[여자]:비만도[고도]',
 '성별[남자]:비만도[정상]',
 '성별[여자]:비만도[정상]']

In [35]:
dmarray = np.asarray(dm)
dmarray

array([[0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0.]])

In [36]:
dummy_df = DataFrame(dmarray, columns = dm.design_info.column_names)
dummy_df

Unnamed: 0,성별[남자]:비만도[경도],성별[여자]:비만도[경도],성별[남자]:비만도[고도],성별[여자]:비만도[고도],성별[남자]:비만도[정상],성별[여자]:비만도[정상]
0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,0.0,1.0,0.0,0.0,0.0


### 원본 데이터가 라벨링 되어 있는 경우

In [43]:
df2 = df.copy()
df2

Unnamed: 0,성별,비만도
0,남자,정상
1,여자,경도
2,여자,정상
3,남자,고도
4,남자,정상
5,남자,경도
6,남자,고도
7,여자,고도
8,여자,경도
9,남자,고도


#### 컬럼을 구분하지 않고 모든 값을 변경
- `데이터프레임.replace('원래값', n, inplace = True)`
    - n = 바꿀값

In [44]:
df2.replace("남자", 0, inplace=True)
df2.replace("여자", 1, inplace=True)
df2

Unnamed: 0,성별,비만도
0,0,정상
1,1,경도
2,1,정상
3,0,고도
4,0,정상
5,0,경도
6,0,고도
7,1,고도
8,1,경도
9,0,고도


#### 특정 컬럼에서만 변경
- `데이터프레임.replace({'특정컬럼': '원래값'}, n, inplace=True)`
    - n = 바꿀값

In [49]:
df2.replace({"비만도": "정상"}, 0, inplace=True)
df2.replace({"비만도": "경도"}, 1, inplace=True)
df2.replace({"비만도": "고도"}, 2, inplace=True)
df2

Unnamed: 0,성별,비만도
0,0,0
1,1,1
2,1,0
3,0,2
4,0,0
5,0,1
6,0,2
7,1,2
8,1,1
9,0,2


In [51]:
df2.dtypes

성별     int64
비만도    int64
dtype: object

### 라벨링 된 데이터의 더미 변수화
- 표현식에 범주형(Category)임을 의미하는 `C` 표기

In [52]:
dm = dmatrix('C(성별): C(비만도) + 0', df2)
dummy_df = DataFrame(np.asarray(dm), columns=dm.design_info.column_names)
dummy_df

Unnamed: 0,C(성별)[0]:C(비만도)[0],C(성별)[1]:C(비만도)[0],C(성별)[0]:C(비만도)[1],C(성별)[1]:C(비만도)[1],C(성별)[0]:C(비만도)[2],C(성별)[1]:C(비만도)[2]
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,1.0,0.0,0.0
9,0.0,0.0,0.0,0.0,1.0,0.0
