# Regression 종합실습 : Car seat sales
유아용 카시트 매출액을 예측해 봅시다.

* 카시트에 대해서 지역 매장 별 매출액을 예측하고자 합니다.

![](https://cdn.images.express.co.uk/img/dynamic/24/590x/child-car-seat-986556.jpg?r=1532946857754)

## 1.환경준비

### (1) Import

In [1]:
#라이브러리들을 불러오자.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

import warnings    # 경고메시지 제외
warnings.filterwarnings(action='ignore')

### (2) Data Loading

In [2]:
data_path = 'https://raw.githubusercontent.com/DA4BAM/dataset/master/Carseats.csv'
data = pd.read_csv(data_path)

**변수설명**
> * Sales - 각 지역 판매량(단위 : 1000개) <== Target
* CompPrice - 각 지역 경쟁사 가격
* Income - 각 지역 평균 소득수준(단위 : 1000달러)
* Advertising - 각 지역, 회사의 광고 예산(단위 : 1000달러)
* Population - 지역 인구수(단위 : 1000명)
* Price - 자사 지역별 판매가격
* ShelveLoc - 진열상태
* Age - 지역 인구의 평균 연령
* Education - 각 지역 교육수준 레벨
* Urban - 매장 도시 지역 여부
* US - 매장이 미국에 있는지 여부

## 2.데이터 이해

* 둘러보기

In [3]:
data.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


## 3.데이터 준비

### (1) 데이터 정리

### (2) 데이터분할1 : x, y 나누기

In [4]:
target = 'Sales'
x = data.drop(target, axis=1)
y = data.loc[:, target]

In [5]:
x.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,138,73,11,276,120,Bad,42,17,Yes,Yes
1,111,48,16,260,83,Good,65,10,Yes,Yes
2,113,35,10,269,80,Medium,59,12,Yes,Yes
3,117,100,4,466,97,Medium,55,14,Yes,Yes
4,141,64,3,340,128,Bad,38,13,Yes,No


In [6]:
y.head()

0     9.50
1    11.22
2    10.06
3     7.40
4     4.15
Name: Sales, dtype: float64

### (3) NA 조치

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    object 
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    object 
 10  US           400 non-null    object 
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB


### (4) 가변수화

In [8]:
# Urban
dumm_Urban = pd.get_dummies(x['Urban'], drop_first=True, prefix='Urban')

In [9]:
dumm_Urban.head()

Unnamed: 0,Urban_Yes
0,1
1,1
2,1
3,1
4,1


In [10]:
# US
dumm_US = pd.get_dummies(x['US'], drop_first=True, prefix='US')

In [11]:
dumm_US.head()

Unnamed: 0,US_Yes
0,1
1,1
2,1
3,1
4,0


In [12]:
# ShelveLoc
dumm_ShelveLoc = pd.get_dummies(x['ShelveLoc'], drop_first=True, prefix='ShelveLoc')

In [13]:
# df 병합
x = pd.concat([x, dumm_Urban, dumm_US, dumm_ShelveLoc], axis=1)

In [14]:
x.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,Urban_Yes,US_Yes,ShelveLoc_Good,ShelveLoc_Medium
0,138,73,11,276,120,Bad,42,17,Yes,Yes,1,1,0,0
1,111,48,16,260,83,Good,65,10,Yes,Yes,1,1,1,0
2,113,35,10,269,80,Medium,59,12,Yes,Yes,1,1,0,1
3,117,100,4,466,97,Medium,55,14,Yes,Yes,1,1,0,1
4,141,64,3,340,128,Bad,38,13,Yes,No,1,0,0,0


In [15]:
# drop
x.drop(['Urban', 'US', 'ShelveLoc'], axis=1, inplace=True)

In [16]:
x.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,Age,Education,Urban_Yes,US_Yes,ShelveLoc_Good,ShelveLoc_Medium
0,138,73,11,276,120,42,17,1,1,0,0
1,111,48,16,260,83,65,10,1,1,1,0
2,113,35,10,269,80,59,12,1,1,0,1
3,117,100,4,466,97,55,14,1,1,0,1
4,141,64,3,340,128,38,13,1,0,0,0


### (5) 데이터분할2 : train : validation 나누기

In [17]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=2022)

### (6) Scaling
KNN 알고리즘을 적용하기 위해서는 스케일링을 해야 합니다.

In [18]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# MinMax
minmax = MinMaxScaler()
x_train_s1 = minmax.fit_transform(x_train)
x_val_s1 = minmax.transform(x_val)

# Standard
standard = StandardScaler()
x_train_s2 = standard.fit_transform(x_train)
x_val_s2 = standard.transform(x_val)

## 4.모델링 : 선형회귀

* 변수를 조절하며 최소 2개 이상의 모델을 생성하고 예측하고 평가해 봅시다.

In [19]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import *

* 모델1

In [20]:
# minmax
model1 = LinearRegression()
model1.fit(x_train_s1, y_train)
pred1 = model1.predict(x_val_s1)

* 모델2

In [21]:
# standard
model2 = LinearRegression()
model2.fit(x_train_s2, y_train)
pred2 = model2.predict(x_val_s2)

In [22]:
# RMSE
print(mean_squared_error(y_val, pred1, squared=False), 
      mean_squared_error(y_val, pred2, squared=False))

1.0241647217116252 1.0241647217116252


In [23]:
# MAE
print(mean_absolute_error(y_val, pred1), 
      mean_absolute_error(y_val, pred2))

0.8163852580798289 0.8163852580798289


In [24]:
# MAPE
print(mean_absolute_percentage_error(y_val, pred1), 
      mean_absolute_percentage_error(y_val, pred2))

0.21169295202880933 0.21169295202880933


In [25]:
# 1 - MAPE
print(1 - mean_absolute_percentage_error(y_val, pred1), 
      1 - mean_absolute_percentage_error(y_val, pred2))

0.7883070479711907 0.7883070479711907


## 5.모델링 : KNN

* 하이퍼파라미터를 조절하며 모델을 최소 3가지 이상 생성하시오.

* 모델3

In [26]:
# default
model3 = KNeighborsRegressor()
model3.fit(x_train_s1, y_train)
pred3 = model3.predict(x_val_s1)

* 모델4

In [27]:
model4 = KNeighborsRegressor(n_neighbors = 10, metric = 'euclidean')
model4.fit(x_train_s1, y_train)
pred4 = model4.predict(x_val_s1)

* 모델5

In [28]:
model5 = KNeighborsRegressor(n_neighbors = 10, metric = 'manhattan')
model5.fit(x_train_s1, y_train)
pred5 = model5.predict(x_val_s1)

## 6.성능비교

In [29]:
# RMSE
print(mean_squared_error(y_val, pred3, squared=False), 
      mean_squared_error(y_val, pred4, squared=False),
      mean_squared_error(y_val, pred5, squared=False))

2.45017136135414 2.4012799625061074 2.314530553553643


In [30]:
# MAE
print(mean_absolute_error(y_val, pred3), 
      mean_absolute_error(y_val, pred4),
      mean_absolute_error(y_val, pred5))

2.0191166666666667 1.9576583333333333 1.9036


In [31]:
# MAPE
print(mean_absolute_percentage_error(y_val, pred3), 
      mean_absolute_percentage_error(y_val, pred4),
      mean_absolute_percentage_error(y_val, pred5))

0.7126276869776597 0.7217472786677768 0.6958516050724891


In [32]:
# 1 - MAPE
print(1 - mean_absolute_percentage_error(y_val, pred3), 
      1 - mean_absolute_percentage_error(y_val, pred4),
      1 - mean_absolute_percentage_error(y_val, pred5))

0.28737231302234034 0.27825272133222323 0.3041483949275109
