## Machine Learning의 종류
 - 지도 학습(Supervised Learning): X와 Y의 관계를 학습시켜 X로 Y를 예측하게 하려는 경우
  - 회귀(Regression): 연속적인 값(예: 가격)을 예측
  - 분류(Classification): 이산적인 값(예: 성별)을 예측
 - 비지도 학습(Unsupervised Learning): 데이터의 패턴을 나타내는 새로운 변수를 만드는 경우
  - 군집(Clustering): 데이터를 비슷한 것끼리 무리(군집)으로 나눔
  - 차원 축소(Dimensionality Reduction): 데이터를 적은 수의 변수로 나타냄
- 강화학습(Reinforcement Learning): 보상과 처벌이 존재하는 상황에서 최적의 정책을 찾으려는 경우

# 데이터 전처리 

### caret
 - Classification And REgression Testing
 - R에서 분류와 회귀를 위한 패키지

In [3]:
#install.packages('caret')

In [4]:
library(caret)

Loading required package: lattice
Loading required package: ggplot2


In [6]:
car = read.csv("../0909/automobile.csv")

In [7]:
head(car)

symboling,normalized_losses,maker,fuel,aspiration,doors,body,wheels,engine_location,wheel_base,⋯,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
2,164,audi,gas,std,four,sedan,fwd,front,99.8,⋯,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.4,⋯,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
1,158,audi,gas,std,four,sedan,fwd,front,105.8,⋯,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,⋯,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
2,192,bmw,gas,std,two,sedan,rwd,front,101.2,⋯,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430
0,192,bmw,gas,std,four,sedan,rwd,front,101.2,⋯,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16925


## 더미 변수 (Dummy Variables)
이산 변수가 가지는 각각의 값을 레벨(level)이라고 한다. 예를 들면 혈액형이라는 변수는 A, B, AB, O 네 가지 레벨을 가질 수 있다. 이들은 수치가 아니므로 분석이 어렵다. 더미 변수는 이산 변수의 각 레벨을 별도의 변수로 만든 것이다. 더미 변수는 1 또는 0의 값을 가진다.
 - one-hot encoding

In [8]:
dummies = dummyVars(city_mpg ~ wheels, car)

In [10]:
head(predict(dummies, newdata = car))

Unnamed: 0,wheels.4wd,wheels.fwd,wheels.rwd
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0
5,0,0,1
6,0,0,1


## 분산이 0인 예측 변수
예측 변수(독립 변수)의 분산이 0이거나 모든 값이 똑같은 경우 대부분의 예측 모형은 계산이 안되거나 잘못된 예측값을 가지게 된다. 분산이 0에 가깝거나 대부분이 하나의 값을 가지더라도 같은 문제가 생길 수 있다.

In [11]:
nearZeroVar(car)

In [12]:
names(car)[9]

In [13]:
car$engine_location

## 연속 변수 고르기 

In [14]:
names(car)

In [15]:
sapply(car, is.numeric)

In [16]:
names(car)[sapply(car, is.numeric)]

In [17]:
cont.vars <- names(car)[sapply(car, is.numeric)]

In [18]:
head(car[cont.vars])

symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
2,164,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102,5500,24,30,13950
2,164,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115,5500,18,22,17450
1,158,105.8,192.7,71.4,55.7,2844,136,3.19,3.4,8.5,110,5500,19,25,17710
1,158,105.8,192.7,71.4,55.9,3086,131,3.13,3.4,8.3,140,5500,17,20,23875
2,192,101.2,176.8,64.8,54.3,2395,108,3.5,2.8,8.8,101,5800,23,29,16430
0,192,101.2,176.8,64.8,54.3,2395,108,3.5,2.8,8.8,101,5800,23,29,16925


## 상관관계가 높은 변수 찾기 

In [19]:
cor(car[cont.vars])

Unnamed: 0,symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
symboling,1.0,0.51834364,-0.52059057,-0.33625676,-0.219186,-0.47518487,-0.25188,-0.1094533,-0.25646934,-0.02128509,-0.13831574,-0.003949088,0.19910557,0.08954963,0.14983007,-0.1627943
normalized_losses,0.518343643,1.0,-0.06008568,0.03554071,0.1097262,-0.41370154,0.1258579,0.2078196,-0.03155814,0.06333048,-0.1272591,0.290510553,0.23769662,-0.23552348,-0.1885642,0.2027613
wheel_base,-0.520590573,-0.06008568,1.0,0.87153448,0.8149912,0.55576713,0.8101815,0.6492056,0.57815853,0.16744868,0.29143145,0.516947529,-0.28923445,-0.5806572,-0.6117499,0.7344189
length,-0.336256756,0.03554071,0.87153448,1.0,0.8383385,0.49925137,0.8712911,0.7259533,0.64631755,0.12107308,0.18481418,0.672063296,-0.23407384,-0.72454445,-0.72459867,0.7609522
width,-0.21918597,0.1097262,0.81499125,0.83833846,1.0,0.2927058,0.8705945,0.7792534,0.57255416,0.19661872,0.25875169,0.681871757,-0.23221605,-0.66668439,-0.69333851,0.8433705
height,-0.47518487,-0.41370154,0.55576713,0.49925137,0.2927058,1.0,0.3670518,0.1110826,0.25483608,-0.09131269,0.23330821,0.034317135,-0.24586416,-0.19973748,-0.22613562,0.2448363
curb_weight,-0.251879975,0.12585792,0.81018149,0.87129108,0.8705945,0.36705181,1.0,0.8886261,0.64579158,0.17384442,0.22472399,0.790095392,-0.25998788,-0.76215523,-0.78933796,0.8936391
engine_size,-0.109453297,0.20781961,0.64920558,0.72595331,0.7792534,0.1110826,0.8886261,1.0,0.59573688,0.29968307,0.14109671,0.812072626,-0.28468581,-0.69913926,-0.7140951,0.8414956
bore,-0.256469345,-0.03155814,0.57815853,0.64631755,0.5725542,0.25483608,0.6457916,0.5957369,1.0,-0.10258113,0.01511908,0.560239168,-0.31226891,-0.59044028,-0.59085039,0.5338904
stroke,-0.021285092,0.06333048,0.16744868,0.12107308,0.1966187,-0.09131269,0.1738444,0.2996831,-0.10258113,1.0,0.24358681,0.148803798,-0.01131191,-0.02005506,-0.01293438,0.1606643


In [20]:
findCorrelation(cor(car[cont.vars]), cutoff = .9)

In [21]:
cont.vars[15]

## Centering과 Scaling
 - centering: 연속 변수에서 평균을 빼주는 것
 - scaling: 연속 변수를 표준편차로 나눠주는 것

기계학습의 성능을 향상시킨다.

In [22]:
cs.pre = preProcess(car, method = c('center','scale'))

In [23]:
cs.pre

Created from 159 samples and 26 variables

Pre-processing:
  - centered (16)
  - ignored (10)
  - scaled (16)


In [24]:
head(predict(cs.pre, car))

symboling,normalized_losses,maker,fuel,aspiration,doors,body,wheels,engine_location,wheel_base,⋯,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
1.0595642,1.202423,audi,gas,std,four,sedan,fwd,front,0.297218,⋯,-0.3357239,mpfi,-0.4119383,0.5549495,-0.04142772,0.2006447,0.8291132,-0.4136385,-0.3222945,0.4260517
1.0595642,1.202423,audi,gas,std,four,sedan,4wd,front,0.2198099,⋯,0.5506615,mpfi,-0.4119383,0.5549495,-0.55563599,0.6238413,0.8291132,-1.397706,-1.5608401,1.0215069
0.2214015,1.034126,audi,gas,std,four,sedan,fwd,front,1.4583399,⋯,0.5506615,mpfi,-0.4119383,0.5549495,-0.42708392,0.4610734,0.8291132,-1.2336948,-1.0963855,1.0657407
0.2214015,1.034126,audi,gas,turbo,four,sedan,fwd,front,1.4583399,⋯,0.3865161,mpfi,-0.6363753,0.5549495,-0.47850475,1.437681,0.8291132,-1.5617173,-1.8704765,2.1145925
1.0595642,1.987808,bmw,gas,std,two,sedan,rwd,front,0.5681465,⋯,-0.368553,mpfi,0.7476527,-1.4797233,-0.34995268,0.1680912,1.4732289,-0.5776497,-0.4771127,0.8479742
-0.6167613,1.987808,bmw,gas,std,four,sedan,rwd,front,0.5681465,⋯,-0.368553,mpfi,0.7476527,-1.4797233,-0.34995268,0.1680912,1.4732289,-0.5776497,-0.4771127,0.9321886


## 데이터 변환
데이터 일정한 범위에 고르게 있지 않는 경우(예: 1, 10, 100, 1000..) 큰 값을 작게 변환해주면 예측력을 높일 수 있다. 이런 변환에는 BoxCox, YeoJohnson 등이 있다. BoxCox는 0보다 큰 값을 갖는 변수에만 사용할 수 있고, YeoJohnson에는 이런 제약이 없다.

In [25]:
tr.pre = preProcess(car, method = 'YeoJohnson')
tr.pre

Created from 159 samples and 22 variables

Pre-processing:
  - ignored (10)
  - Yeo-Johnson transformation (12)

Lambda estimates for Yeo-Johnson transformation:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.1260 -0.6780 -0.3015 -0.1524  0.1700  1.3301 


In [26]:
head(predict(tr.pre, car))

symboling,normalized_losses,maker,fuel,aspiration,doors,body,wheels,engine_location,wheel_base,⋯,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
1.8832956,2.73449,audi,gas,std,four,sedan,fwd,front,99.8,⋯,0.8836456,mpfi,1.144342,3.4,10.0,1.703635,172.1545,2.871724,3.759902,1.291944
1.8832956,2.73449,audi,gas,std,four,sedan,4wd,front,99.4,⋯,0.8846235,mpfi,1.144342,3.4,8.0,1.71312,172.1545,2.652102,3.405769,1.292072
0.9647145,2.725423,audi,gas,std,four,sedan,fwd,front,105.8,⋯,0.8846235,mpfi,1.144342,3.4,8.5,1.709675,172.1545,2.693482,3.550559,1.29208
0.9647145,2.725423,audi,gas,turbo,four,sedan,fwd,front,105.8,⋯,0.8844744,mpfi,1.135284,3.4,8.3,1.727436,172.1545,2.608318,3.298929,1.292218
1.8832956,2.771846,bmw,gas,std,two,sedan,rwd,front,101.2,⋯,0.8835994,mpfi,1.18854,2.8,8.8,1.702829,177.0659,2.839331,3.72073,1.29204
0.0,2.771846,bmw,gas,std,four,sedan,rwd,front,101.2,⋯,0.8835994,mpfi,1.18854,2.8,8.8,1.702829,177.0659,2.839331,3.72073,1.292056


## 데이터 분할

In [27]:
set.seed(1234) # 결과를 동일하게 보여주기 위한 용도, 실무에는 불필요하다

In [28]:
idx <- createDataPartition(car$price, p=.8, list=F, times =1)

In [29]:
as.vector(idx)

In [30]:
train.data <- car[idx,]
test.data <- car[-idx,]

In [31]:
dim(train.data)

In [32]:
dim(test.data)

In [33]:
head(test.data)

Unnamed: 0,symboling,normalized_losses,maker,fuel,aspiration,doors,body,wheels,engine_location,wheel_base,⋯,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
2,2,164,audi,gas,std,four,sedan,4wd,front,99.4,⋯,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
10,1,98,chevrolet,gas,std,two,hatchback,fwd,front,94.5,⋯,90,2bbl,3.03,3.11,9.6,70,5400,38,43,6295
21,2,137,honda,gas,std,two,hatchback,fwd,front,86.6,⋯,92,1bbl,2.91,3.41,9.2,76,6000,31,38,6855
23,1,101,honda,gas,std,two,hatchback,fwd,front,93.7,⋯,92,1bbl,2.91,3.41,9.2,76,6000,30,34,6529
26,0,78,honda,gas,std,four,wagon,fwd,front,96.5,⋯,92,1bbl,2.92,3.41,9.2,76,6000,30,34,7295
28,0,106,honda,gas,std,two,hatchback,fwd,front,96.5,⋯,110,1bbl,3.15,3.58,9.0,86,5800,27,33,9095
