## 1. データの準備
- データの読み込み
- 特徴量Xと正解ラベルyの設定

#### ●データの読み込み

In [73]:
# 必要なライブラリを読み込む
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import train_test_split,KFold
import pandas as pd
from IPython.core.display import display


# csvファイルを読み込む
df_data = pd.read_csv('data.csv', header=0, quotechar='"', encoding='cp932')

# csvファイルの読み込みの正常終了を確認する
display(df_data.head(5), df_data.shape)

# csvファイルを読み込む
df_score = pd.read_csv('score.csv', header=0, quotechar='"', encoding='cp932')

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


(9532, 16)

#### ●特徴量Xと正解ラベルyの設定

In [3]:
# 特徴量Xと正解ラベルyの設定
X = df_data.iloc[:, 0:-1]
y = df_data.iloc[:, -1]
X_score = df_score.iloc[:, :]

# 特徴量Xと正解ラベルyの設定の正常終了を確認する
display(X.head(5), X.shape)
display(y.head(5), y.shape)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916


(9532, 15)

0    46135
1    40650
2    36350
3    29450
4    34500
Name: MSRP, dtype: int64

(9532,)

## 2. データの加工
- 欠損状況の確認
- 基本統計量の確認
- yearに関する新しい特徴量の作成
- 欠損値の補完
- 使用する特徴量の選択

#### ●欠損状況の確認

In [4]:
display(df_data.isnull().sum())

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              53
Engine Cylinders       22
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      2999
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

#### ●基本統計量の確認

In [16]:
df_data.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP
count,9532.0,9479.0,9510.0,9526.0,9532.0,9532.0,9532.0,9532.0
mean,2010.399077,249.478637,5.632387,3.435335,26.60512,19.720835,1556.40726,40783.78
std,7.549785,109.239858,1.786855,0.881758,8.306401,8.906915,1443.035732,62641.47
min,1990.0,55.0,0.0,2.0,12.0,7.0,2.0,2000.0
25%,2007.0,170.0,4.0,2.0,22.0,16.0,549.0,21143.75
50%,2015.0,227.0,6.0,4.0,26.0,18.0,1385.0,29995.0
75%,2016.0,300.0,6.0,4.0,30.0,22.0,2009.0,42220.0
max,2017.0,1001.0,16.0,4.0,111.0,137.0,5657.0,2065902.0


#### ●yearに関する特徴量の作成
- 特徴量作成の1つの例として、yearのデータをもとに車両が販売されてからの経過年数という新しい特徴量を生成する。

In [24]:
# 2017年を基準に車両が販売されてからの経過年数を新しい特徴量として生成する
X['Duration Since Production'] = 2019 - X['Year']
X_score['Duration Since Production'] = 2019 - X_score['Year']
display(X.head(5))

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,Duration Since Production,Make_Model
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,8,BMW_1 Series M
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,8,BMW_1 Series
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,8,BMW_1 Series
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,8,BMW_1 Series
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,8,BMW_1 Series


In [25]:
X['Make_Model'] = X['Make'] + '_' + X['Model']
X_score['Make_Model'] = X_score['Make'] + '_' + X_score['Model']

#### ●欠損値の補完

In [7]:
# Engine HP, Engine Cylinders, Number of Doorsの欠損値を中央値で補完する
#X_complement = X.fillna(X.median())

# 欠損が補完されたことを確認する
#display(X_complement.isnull().sum())

#### ●使用する特徴量の選択

In [34]:
# 特徴量を選択する
X_choice = X.copy()
X_choice.drop('Year',axis=1,inplace=True)
X_score_choice = X_score.copy()
X_score_choice.drop('Year',axis=1,inplace=True)

In [42]:
#合算してからダミー変数作成する 1:train 0:test
X_choice['kubun'] = 1
X_score_choice['kubun'] = 0
dataset = pd.concat(objs=[X_choice, X_score_choice], axis=0) 

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  after removing the cwd from sys.path.


In [44]:
dataset = pd.get_dummies(dataset)

In [45]:
X_choice = dataset[dataset['kubun'] == 1].drop('kubun',axis=1)
X_score_choice = dataset[dataset['kubun'] == 0].drop('kubun',axis=1)

## 3. 学習器の作成と評価
- ホールドアウトによる学習用データ、検証用データの分割
- 線形回帰による予測モデル作成
- 作成したモデルによる予測値算出
- RMSEによるモデル評価

#### ●ホールドアウトによる学習用データ、検証用データの分割

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X_choice,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=616)

# 学習用データ、検証用データの分割の正常終了を確認する
display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(7625, 2007)

(1907, 2007)

(7625,)

(1907,)

#### ●線形回帰による予測モデル作成

In [75]:
import lightgbm as lgb

In [77]:
# 線形回帰モデルの学習を行う
#linear_regression = linear_model.LinearRegression()
#linear_regression.fit(X_train, y_train)

model = lgb.LGBMRegressor(n_estimators=200,reg_lambda=0.1,num_leaves=61,randam_state=616)
model.fit(X_train, y_train)

# 線形回帰モデルの学習を行った学習器に対して学習用データ、検証用データでRMSEを計算する
display(np.sqrt(mean_squared_error(y_train, model.predict(X_train))))
display(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))

# 検証データの予測結果を表示する
predict_X_test = model.predict(X_test)
display(predict_X_test)

13915.104380327279

11306.601031424407

array([158252.952998  ,  15510.10855673,  28780.05650776, ...,
        44739.43900171,  29628.59236005,  44553.81502089])

In [79]:
#Best Score
#model = lgb.LGBMRegressor(n_estimators=200,reg_lambda=0.1,num_leaves=61,randam_state=616)
#13915.104380327279
#11306.601031424407

In [78]:
model

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       importance_type='split', learning_rate=0.1, max_depth=-1,
       min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
       n_estimators=200, n_jobs=-1, num_leaves=61, objective=None,
       randam_state=616, random_state=None, reg_alpha=0.0, reg_lambda=0.1,
       silent=True, subsample=1.0, subsample_for_bin=200000,
       subsample_freq=0)

## 4. scoreデータの予測値をcsv出力
- scoreデータの読み込み
- 特徴量Xの設定
- データの加工
- 3.で作成した学習器に対してtestデータの特徴量から予測値算出
- 予測値のcsv出力

#### ●scoreデータの読み込み

#### ●特徴量Xの設定

#### ●データの加工

In [80]:
# データの加工の正常終了を確認する
display(X_score_choice.head(5))

Unnamed: 0,Driven_Wheels_all wheel drive,Driven_Wheels_four wheel drive,Driven_Wheels_front wheel drive,Driven_Wheels_rear wheel drive,Duration Since Production,Engine Cylinders,Engine Fuel Type_diesel,Engine Fuel Type_electric,Engine Fuel Type_flex-fuel (premium unleaded recommended/E85),Engine Fuel Type_flex-fuel (premium unleaded required/E85),...,Vehicle Style_Coupe,Vehicle Style_Crew Cab Pickup,Vehicle Style_Extended Cab Pickup,Vehicle Style_Passenger Minivan,Vehicle Style_Passenger Van,Vehicle Style_Regular Cab Pickup,Vehicle Style_Sedan,Vehicle Style_Wagon,city mpg,highway MPG
0,0,0,0,1,7,6.0,0,0,0,0,...,0,0,0,0,0,0,0,0,18,28
1,0,0,0,1,6,6.0,0,0,0,0,...,0,0,0,0,0,0,0,0,18,27
2,0,0,1,0,27,6.0,0,0,0,0,...,0,0,0,0,0,0,1,0,17,24
3,1,0,0,0,27,6.0,0,0,0,0,...,0,0,0,0,0,0,0,1,16,20
4,1,0,0,0,25,6.0,0,0,0,0,...,0,0,0,0,0,0,1,0,16,22


#### ●scoreデータの特徴量からラベルの予測値算出

In [81]:
# 線形回帰モデルの学習を行った学習器に対して、score用データを用いてラベルの予測値を算出する
predict_X_score = model.predict(X_score_choice)

# score用データのラベルの予測値を表示する
display(predict_X_score, predict_X_score.shape)

array([36398.16219441, 37357.11303593,  1865.55645336, ...,
       68260.6882027 , 51622.51091433, 51622.23898301])

(2382,)

#### ●予測値のcsv出力

In [83]:
np.savetxt("predict_X_score_堀部.csv", predict_X_score, delimiter=",", fmt='%.5f')