<Part2 - 데이터 분석(회귀)>

모델
- 단일/다중
  - 선형 회귀에서 train/test 각각에 predict하는 이유는 overfitting인지 generalization되었는지 확인하기 위함
- 의사결정나무 회귀
- 로지스틱 회귀
- 랜덤포레스트

순서
- 필요 패키지 import
- Data 불러오고, 간단한 EDA
- Data 전처리
- train/test 분리 -> 분석 수행
- 성능 평가

## Simple Linear Regression - auto-mpg
- *y = ax + b*


### 1.패키지 import

In [103]:
import pandas as pd
import numpy as np
import sklearn

In [104]:
# linearRegression 수행을 위한 패키지
from sklearn.linear_model import LinearRegression

# data split
from sklearn.model_selection import train_test_split

### 2.데이터 불러오기 + EDA
auto-mpg.csv
- 자동차의 연비 예측
- mpg : 연비

In [105]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/auto-mpg.csv')

In [106]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year
0,18.0,8,307.0,130.0,3504,12.0,70
1,15.0,8,350.0,165.0,3693,11.5,70
2,18.0,8,318.0,150.0,3436,11.0,70
3,16.0,8,304.0,150.0,3433,12.0,70
4,17.0,8,302.0,140.0,3449,10.5,70


In [107]:
df.info() ## horsepower에 결측치 2개

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    396 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model-year    398 non-null    int64  
dtypes: float64(4), int64(3)
memory usage: 21.9 KB


In [108]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year
count,398.0,398.0,398.0,396.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.189394,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.40203,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,92.0,2803.5,15.5,76.0
75%,29.0,8.0,262.0,125.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


In [109]:
# 각 변수간 상관관계 파악
corr = df.corr(method="pearson")
print(corr)
## mpg에 대해서 강한 음의 상관관계를 가진 변수 displacement, weight
## 0.7 이상이면 강한 상관관계를 가진다고 할 수 있으며, 대체로 모든 변수가 mpg에 상관성이 있음으로 보여짐

                   mpg  cylinders  displacement  horsepower    weight  \
mpg           1.000000  -0.775396     -0.804203   -0.777575 -0.831741   
cylinders    -0.775396   1.000000      0.950721    0.843751  0.896017   
displacement -0.804203   0.950721      1.000000    0.897787  0.932824   
horsepower   -0.777575   0.843751      0.897787    1.000000  0.864350   
weight       -0.831741   0.896017      0.932824    0.864350  1.000000   
acceleration  0.420289  -0.505419     -0.543684   -0.687241 -0.417457   
model-year    0.579267  -0.348746     -0.370164   -0.420697 -0.306564   

              acceleration  model-year  
mpg               0.420289    0.579267  
cylinders        -0.505419   -0.348746  
displacement     -0.543684   -0.370164  
horsepower       -0.687241   -0.420697  
weight           -0.417457   -0.306564  
acceleration      1.000000    0.288137  
model-year        0.288137    1.000000  


### 3.데이터 전처리

In [110]:
# 결측치 행 제거
# horsepower에 대해 2개 결측치 -> 제거
df = df.dropna(axis = 0)
df.info()

# 결측치 제거 후 상관성 변화 거의 없음
# corr = df.corr(method="pearson")
# print(corr)

<class 'pandas.core.frame.DataFrame'>
Index: 396 entries, 0 to 397
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           396 non-null    float64
 1   cylinders     396 non-null    int64  
 2   displacement  396 non-null    float64
 3   horsepower    396 non-null    float64
 4   weight        396 non-null    int64  
 5   acceleration  396 non-null    float64
 6   model-year    396 non-null    int64  
dtypes: float64(4), int64(3)
memory usage: 24.8 KB


### 4.분석 데이터셋 준비 및 분석

In [111]:
X = df[['weight']] ## 2차원 형태로 추출해야 함
y = df['mpg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=10)

In [112]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(316, 1)
(80, 1)
(316,)
(80,)


In [113]:
# model fitting
lr = LinearRegression()
lr.fit(X_train, y_train)

In [116]:
# regression model 도출
# summary 없어 직접 도출해야 함
print("계수(coef_):", lr.coef_)
print("절편(intercept_):", lr.intercept_)
## y = -0.00774371x + 46.62501834798047

계수(coef_): [-0.00774371]
절편(intercept_): 46.62501834798047


In [117]:
pred = lr.predict(X_test)

### 5.성능평가
- 선형회귀에서는 R^2가 클수록 예측력이 좋다고 할 수 있음
- sklearn.metrics의 r2_score()

In [94]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, pred)
print("r^2 : ", r2) ## 정확도 0.7

r^2 :  0.7015633872576372


- 선형회귀에서는 학습 데이터로도 수행 가능  
->이때 학습 데이터셋으로 예측 먼저 수행

In [95]:
# predict
pred = lr.predict(X_train)

# 모델 성능 평가
r2_x = r2_score(y_train, pred)
print("r2 for train set: ", r2_x) ## train set에서의 성능 0.69

r2 for train set:  0.6875735975346924


동일한 방법으로 다른 변수에 대해서 선형 회귀 진행
- horsepower


In [118]:
# 데이터셋 분리
X = df[['horsepower']]
y = df['mpg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
print(X_train.shape)

# 모델 적합
lr2 = LinearRegression()
lr2.fit(X_train, y_train)

# 회귀계수 추출
coef = lr2.coef_[0]
intercept = lr2.intercept_

print(f"회귀식 : mpg = {coef:.4f} * x + {intercept:.4f}")

# predict
# pred = lr2.predict(X_test)
# print(pred)

# 모델 평가
from sklearn.metrics import r2_score
r2_lr2 = r2_score(y_test, pred)
print(r2_lr2)

(316, 1)
회귀식 : mpg = -0.1604 * x + 40.3134
0.7015633872576372


- displacement

In [119]:
# 데이터셋 분리
X = df[['displacement']]
y = df['mpg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# 모델 fitting
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

# 회귀 계수 추출
coef = lr.coef_[0]
intercept = lr.intercept_
print(f"회귀식 : mpg = {coef:.4f} * x + {intercept:.4f}")

# predict
pred = lr.predict(X_test)

# 모델 평가
from sklearn.metrics import r2_score
r2 = r2_score(y_test, pred)
print("r2: ", r2)

회귀식 : mpg = -0.0618 * x + 35.4934
r2:  0.6094679140387653


## Multiple Linear Regression - housing.csv
- 다중 선형 회귀
- 각 독립변수에 대해서 유의성 검증 필요

### 1.필요 패키지 import

In [120]:
import pandas as pd
import numpy as np
import sklearn

# 선형 회귀 모델
from sklearn.linear_model import LinearRegression

# split
from sklearn.model_selection import train_test_split

### 2.데이터 불러오기 + EDA

In [121]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/housing.csv')

In [123]:
df.info() ## total_bedrooms 결측치 있음

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [124]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### 3.데이터 전처리
- ocean_proximity : 바다 접근성 -> 문자형 데이터라 제외
- total_bedrooms : 결측치 제외

In [126]:
df['ocean_proximity'].unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [127]:
# 결측지 제거
df = df.dropna(axis = 0)

# 분석에서는 ocean_proximity 제외
df = df.drop("ocean_proximity", axis=1)

In [128]:
# 상관관계 확인
df.corr(method="pearson")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
longitude,1.0,-0.924616,-0.109357,0.04548,0.069608,0.10027,0.056513,-0.01555,-0.045398
latitude,-0.924616,1.0,0.011899,-0.036667,-0.066983,-0.108997,-0.071774,-0.079626,-0.144638
housing_median_age,-0.109357,0.011899,1.0,-0.360628,-0.320451,-0.295787,-0.302768,-0.118278,0.106432
total_rooms,0.04548,-0.036667,-0.360628,1.0,0.93038,0.857281,0.918992,0.197882,0.133294
total_bedrooms,0.069608,-0.066983,-0.320451,0.93038,1.0,0.877747,0.979728,-0.007723,0.049686
population,0.10027,-0.108997,-0.295787,0.857281,0.877747,1.0,0.907186,0.005087,-0.0253
households,0.056513,-0.071774,-0.302768,0.918992,0.979728,0.907186,1.0,0.013434,0.064894
median_income,-0.01555,-0.079626,-0.118278,0.197882,-0.007723,0.005087,0.013434,1.0,0.688355
median_house_value,-0.045398,-0.144638,0.106432,0.133294,0.049686,-0.0253,0.064894,0.688355,1.0


### 4.분석 데이터셋 준비 및 분석
- median_house_value 예측

In [129]:
# 데이터셋 분리
X = df.drop("median_house_value", axis=1)
y = df['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(14303, 8)
(6130, 8)
(14303,)
(6130,)


In [130]:
lr = LinearRegression()
lr.fit(X_train, y_train)

coef = lr.coef_
intercept = lr.intercept_
print(coef)
print(intercept)

[-4.21262308e+04 -4.20623763e+04  1.18784999e+03 -8.57874086e+00
  1.18123421e+02 -3.55751755e+01  3.73676747e+01  4.03297253e+04]
-3530241.307796566


In [131]:
# predict
pred = lr.predict(X_test)

### 5.성능평가

In [132]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, pred)
print(r2)

0.6445130291082337


In [133]:
# 학습 데이터셋에 대해서도 예측을 수행해보자
pred = lr.predict(X_train)

r2_tr = r2_score(y_train, pred)
print(r2_tr)

0.6334125389213838


## DecisionTree Regression - housing.csv
- overfitting 위험이 있지만,,,
- MSE(Mean Squared Error)로 평가, 작을수록 예측력 좋음

### 1.필요 패키지 import
- 기본 패키지 생략

In [161]:
# decisiontree -> tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

### 2.데이터 불러서 EDA

In [185]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/housing.csv')

In [186]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


### 3.데이터 전처리

In [187]:
# 결측치 제거
df = df.dropna(axis=0)

In [188]:
# ocean_proximity 변수 제외
df = df.drop("ocean_proximity", axis = 1)

In [189]:
# 상관관계 파익
df.corr(method="pearson")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
longitude,1.0,-0.924616,-0.109357,0.04548,0.069608,0.10027,0.056513,-0.01555,-0.045398
latitude,-0.924616,1.0,0.011899,-0.036667,-0.066983,-0.108997,-0.071774,-0.079626,-0.144638
housing_median_age,-0.109357,0.011899,1.0,-0.360628,-0.320451,-0.295787,-0.302768,-0.118278,0.106432
total_rooms,0.04548,-0.036667,-0.360628,1.0,0.93038,0.857281,0.918992,0.197882,0.133294
total_bedrooms,0.069608,-0.066983,-0.320451,0.93038,1.0,0.877747,0.979728,-0.007723,0.049686
population,0.10027,-0.108997,-0.295787,0.857281,0.877747,1.0,0.907186,0.005087,-0.0253
households,0.056513,-0.071774,-0.302768,0.918992,0.979728,0.907186,1.0,0.013434,0.064894
median_income,-0.01555,-0.079626,-0.118278,0.197882,-0.007723,0.005087,0.013434,1.0,0.688355
median_house_value,-0.045398,-0.144638,0.106432,0.133294,0.049686,-0.0253,0.064894,0.688355,1.0


### 4.분석 데이터셋 준비 및 분석

In [190]:
# 데이터셋 분리
X = df.drop("median_house_value", axis = 1)
y = df['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [179]:
# decisiontreeregression 모델 객체 생성
dtr = DecisionTreeRegressor(max_depth = 3, random_state=42)
dtr.fit(X_train, y_train)

In [180]:
pred = dtr.predict(X_test)
print(pred)

[201923.31611709 201923.31611709 201923.31611709 ... 115608.30781105
 201923.31611709 158373.33035144]


### 5.성능평가

In [181]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print("mse: ", mse) ## mse값이 작을수록 성능 좋음

mse:  7056833754.187382


In [182]:
# train set 에 대한 성능 평가
pred = dtr.predict(X_train)
mse = mean_squared_error(y_train, pred)
print("mse: ", mse)

mse:  6636838236.287059


## RandomForestRegression
- 가장 성능 좋을거라서 시험에서는 아묻따! 랜포!
- 기본 매개변수 설정만으로도 좋은 결과를 얻을 수 있음


### 1. 필요 패키지 import

In [183]:
import pandas as pd
import numpy as np
import sklearn

# RandomForestRegression을 위한 패키지
from sklearn.ensemble import RandomForestRegressor

# data split
from sklearn.model_selection import train_test_split

### 2.데이터 불러와!

In [184]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/housing.csv')

### 3.전처리
- DecisionTreeRegression 모델과 동일

In [192]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


### 4.분석 데이터셋 준비 및 분석

In [193]:
X = df.drop("median_house_value", axis =1)
y = df['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [195]:
# modeling
rf = RandomForestRegressor(max_depth=3, random_state=42)
rf.fit(X_train, y_train)

In [196]:
# predict
pred = rf.predict(X_test)
print(pred)

[185613.12090302 200380.41589374 202368.90900627 ... 199257.87619654
 118624.34177765 118624.34177765]


### 5.성능평가
- 위에서 의사결정나무 회귀 모델보다 mse 작음
- 랜포가 성능 더 좋다고 할 수 있음

In [197]:
# random forest 성능 평가는 mse
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print(mse)

6447828605.376922


In [199]:
pred = rf.predict(X_train)
mse = mean_squared_error(y_train, pred)
print("mse: ", mse)

mse:  6342421033.759215
