# 선형회귀

현대 중공업과 계약을 맺어 일부 선박에 대한 예측 모델을 구축하게됐습니다. 현대 중공업은 세계 최대의 선박 제조업체 중 하나로 유람선을 제작하고 있습니다.
당신은 선박에 필요한 선원 수를 정확하게 예측할 수 있도록 울산에있는 본사에 도착했습니다.
그들은 현재 새로운 선박을 건조하고 있으며 예측 모델을 만들고, 이를 사용하여 선박에 필요한 승무원 수를 예측하기를 원합니다.

지금까지의 데이터는 다음과 같습니다.

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
위 데이터는 "cruise_ship_info.csv"라는 csv 파일에 저장됩니다. 귀하의 임무는 향후 선박에 필요한 선원 수를 예측하는 데 도움이되는 회귀 모델을 만드는 것입니다. 고객은 또한 특정 크루즈 라인이 허용되는 승무원 수에 차이가 있음을 발견 했으므로 분석에 포함하는 것이 가장 중요한 기능이라고 언급했습니다!

In [909]:
import pandas as pd 
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures,StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
import sklearn
from sklearn import linear_model
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [910]:
df = pd.read_csv("./data/cruise_ship_info_example.csv")
df = df.iloc[: , 1:]

In [911]:
df.head()

Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,Journey,Azamara,6,30.277,6.94,5.94,3.55,42.64,
1,Quest,Azamara,6,30.277,6.94,5.94,3.55,42.64,3.55
2,Celebration,Carnival,26,47.262,14.86,7.22,7.43,31.8,
3,Conquest,Carnival,11,110.0,29.74,9.53,14.88,36.99,19.1
4,Destiny,Carnival,17,101.353,26.42,8.92,13.21,38.36,10.0


- shipname, cruise_line(선사-배를운영하는), age(배나이), tonnage(배수량),승객수, cabins(객실수), passenger_density 승객밀도, crew 승무원 ( 종속변수- 정수형이아니라 실수형임) 

- 선원수를 예측하는데 도움이 되는 회귀 문제, 선사(크루즈라인 ) 분석포함에 가장중요 

- 중간중간에 크루를 비워둠 

- 데이터 7(답:종속변수) : 3 (y가 없음, x)  - 70% 데이터를 가지고 모델학습, 30%에 대한 예측값pred 뽑기 -> score 점수 

- 선생님한테 보내기 

- 지표는 mse 를 쓸것임 

- 70% 데이터의 훈련데이터를 다시 쪼개서 일부데이터를 가지고 학습 ,그걸 가지고 예측 

- 전체에서 일부 답을 가려놓음 -> 그답을 찾는게 30% x값 
선원 몇명일까요?

In [912]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Ship_name          158 non-null    object 
 1   Cruise_line        158 non-null    object 
 2   Age                158 non-null    int64  
 3   Tonnage            158 non-null    float64
 4   passengers         158 non-null    float64
 5   length             158 non-null    float64
 6   cabins             158 non-null    float64
 7   passenger_density  158 non-null    float64
 8   crew               110 non-null    float64
dtypes: float64(6), int64(1), object(2)
memory usage: 11.2+ KB


In [913]:
df.describe()

Unnamed: 0,Age,Tonnage,passengers,length,cabins,passenger_density,crew
count,158.0,158.0,158.0,158.0,158.0,158.0,110.0
mean,15.689873,71.284671,18.457405,8.130633,8.83,39.900949,7.728909
std,7.615691,37.22954,9.677095,1.793474,4.471417,8.639217,3.563549
min,4.0,2.329,0.66,2.79,0.33,17.7,0.59
25%,10.0,46.013,12.535,7.1,6.1325,34.57,5.2
50%,14.0,71.899,19.5,8.555,9.57,39.085,8.63
75%,20.0,90.7725,24.845,9.51,10.885,44.185,10.0
max,48.0,220.0,54.0,11.82,27.0,71.43,19.1


In [914]:
df.isnull().sum()

Ship_name             0
Cruise_line           0
Age                   0
Tonnage               0
passengers            0
length                0
cabins                0
passenger_density     0
crew                 48
dtype: int64

In [915]:
df.corr(numeric_only = True)

Unnamed: 0,Age,Tonnage,passengers,length,cabins,passenger_density,crew
Age,1.0,-0.606646,-0.515542,-0.532286,-0.510019,-0.27883,-0.5547
Tonnage,-0.606646,1.0,0.945061,0.922368,0.948764,-0.040846,0.933967
passengers,-0.515542,0.945061,1.0,0.883535,0.976341,-0.294867,0.920679
length,-0.532286,0.922368,0.883535,1.0,0.889798,-0.090488,0.918819
cabins,-0.510019,0.948764,0.976341,0.889798,1.0,-0.253181,0.95022
passenger_density,-0.27883,-0.040846,-0.294867,-0.090488,-0.253181,1.0,-0.094822
crew,-0.5547,0.933967,0.920679,0.918819,0.95022,-0.094822,1.0


In [916]:
df.columns

Index(['Ship_name', 'Cruise_line', 'Age', 'Tonnage', 'passengers', 'length',
       'cabins', 'passenger_density', 'crew'],
      dtype='object')

In [917]:
df['crew'].unique()

array([  nan,  3.55, 19.1 , 10.  ,  9.2 , 11.6 ,  9.3 , 10.29, 11.5 ,
        8.58,  9.99,  9.09,  0.6 ,  6.7 ,  4.  ,  6.36, 10.68,  3.85,
        6.  , 10.9 ,  7.66,  5.45,  9.21, 12.53,  9.45,  8.  ,  5.3 ,
        4.6 ,  5.88,  5.61,  5.31, 13.13,  7.  ,  9.87,  7.4 ,  2.97,
        4.7 , 11.  ,  6.14, 11.09,  4.38,  6.3 ,  3.8 ,  3.5 ,  8.69,
        5.2 ,  8.5 ,  9.  ,  7.94, 12.2 , 12.  , 12.38, 11.1 ,  3.73,
        6.96,  1.46,  3.24,  2.11, 11.85, 11.76, 13.6 ,  7.6 ,  8.22,
        8.68,  8.08,  6.6 ,  1.6 ,  2.1 ,  2.87,  1.97,  6.8 ,  0.59,
        0.88,  1.8 ])

In [918]:
df.isnull()

Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
153,False,False,False,False,False,False,False,False,False
154,False,False,False,False,False,False,False,False,False
155,False,False,False,False,False,False,False,False,False
156,False,False,False,False,False,False,False,False,False


In [919]:
len(df['crew'])

158

In [920]:
df['crew'].isnull().sum()

48

In [921]:
df = df.dropna(axis=0)

In [922]:
df['crew']

1       3.55
3      19.10
4      10.00
5       9.20
6       9.20
       ...  
153     0.59
154    12.00
155     0.88
156     0.88
157     1.80
Name: crew, Length: 110, dtype: float64

In [923]:
df.isnull().sum()

Ship_name            0
Cruise_line          0
Age                  0
Tonnage              0
passengers           0
length               0
cabins               0
passenger_density    0
crew                 0
dtype: int64

In [924]:
# df["crew"] = df["crew"].replace("nan", np.nan)

In [925]:
# df = df.dropna(subset = ["crew"], axis = 0) 

In [926]:
df['crew'].unique()

array([ 3.55, 19.1 , 10.  ,  9.2 , 11.6 ,  9.3 , 10.29, 11.5 ,  8.58,
        9.99,  9.09,  0.6 ,  6.7 ,  4.  ,  6.36, 10.68,  3.85,  6.  ,
       10.9 ,  7.66,  5.45,  9.21, 12.53,  9.45,  8.  ,  5.3 ,  4.6 ,
        5.88,  5.61,  5.31, 13.13,  7.  ,  9.87,  7.4 ,  2.97,  4.7 ,
       11.  ,  6.14, 11.09,  4.38,  6.3 ,  3.8 ,  3.5 ,  8.69,  5.2 ,
        8.5 ,  9.  ,  7.94, 12.2 , 12.  , 12.38, 11.1 ,  3.73,  6.96,
        1.46,  3.24,  2.11, 11.85, 11.76, 13.6 ,  7.6 ,  8.22,  8.68,
        8.08,  6.6 ,  1.6 ,  2.1 ,  2.87,  1.97,  6.8 ,  0.59,  0.88,
        1.8 ])

In [927]:
len(df['crew'])

110

In [928]:
ndf = df[['Cruise_line', 'Age', 'Tonnage', 'passengers', 'length',
       'cabins', 'crew']]
ndf.head()

Unnamed: 0,Cruise_line,Age,Tonnage,passengers,length,cabins,crew
1,Azamara,6,30.277,6.94,5.94,3.55,3.55
3,Carnival,11,110.0,29.74,9.53,14.88,19.1
4,Carnival,17,101.353,26.42,8.92,13.21,10.0
5,Carnival,22,70.367,20.52,8.55,10.2,9.2
6,Carnival,15,70.367,20.52,8.55,10.2,9.2


In [929]:
ndf["Cruise_line"].unique()

array(['Azamara', 'Carnival', 'Celebrity', 'Costa', 'Crystal', 'Cunard',
       'Disney', 'Holland_American', 'MSC', 'Norwegian', 'Oceania',
       'Orient', 'P&O', 'Princess', 'Regent_Seven_Seas',
       'Royal_Caribbean', 'Seabourn', 'Silversea', 'Star', 'Windstar'],
      dtype=object)

In [930]:
ohe = OneHotEncoder()

In [931]:
cl_arr = np.array(ndf['Cruise_line'])

In [932]:
len(cl_arr)

110

In [933]:
cl_arr = np.reshape(cl_arr, (-1, 1))

In [934]:
cl_name = ohe.fit_transform(cl_arr)

In [935]:
ohe.get_feature_names_out()

array(['x0_Azamara', 'x0_Carnival', 'x0_Celebrity', 'x0_Costa',
       'x0_Crystal', 'x0_Cunard', 'x0_Disney', 'x0_Holland_American',
       'x0_MSC', 'x0_Norwegian', 'x0_Oceania', 'x0_Orient', 'x0_P&O',
       'x0_Princess', 'x0_Regent_Seven_Seas', 'x0_Royal_Caribbean',
       'x0_Seabourn', 'x0_Silversea', 'x0_Star', 'x0_Windstar'],
      dtype=object)

In [936]:
cl_name

<110x20 sparse matrix of type '<class 'numpy.float64'>'
	with 110 stored elements in Compressed Sparse Row format>

In [937]:
ohe_cl_df = pd.DataFrame(cl_name.toarray(), columns = ohe.get_feature_names_out())

In [938]:
ohe_cl_df

Unnamed: 0,x0_Azamara,x0_Carnival,x0_Celebrity,x0_Costa,x0_Crystal,x0_Cunard,x0_Disney,x0_Holland_American,x0_MSC,x0_Norwegian,x0_Oceania,x0_Orient,x0_P&O,x0_Princess,x0_Regent_Seven_Seas,x0_Royal_Caribbean,x0_Seabourn,x0_Silversea,x0_Star,x0_Windstar
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [939]:
ohe_cl_df.shape

(110, 20)

In [940]:
ohe_cl_df.isnull().sum()

x0_Azamara              0
x0_Carnival             0
x0_Celebrity            0
x0_Costa                0
x0_Crystal              0
x0_Cunard               0
x0_Disney               0
x0_Holland_American     0
x0_MSC                  0
x0_Norwegian            0
x0_Oceania              0
x0_Orient               0
x0_P&O                  0
x0_Princess             0
x0_Regent_Seven_Seas    0
x0_Royal_Caribbean      0
x0_Seabourn             0
x0_Silversea            0
x0_Star                 0
x0_Windstar             0
dtype: int64

In [941]:
ndf = pd.concat([ndf, ohe_cl_df], axis = 1)

In [942]:
ndf.head()

Unnamed: 0,Cruise_line,Age,Tonnage,passengers,length,cabins,crew,x0_Azamara,x0_Carnival,x0_Celebrity,...,x0_Oceania,x0_Orient,x0_P&O,x0_Princess,x0_Regent_Seven_Seas,x0_Royal_Caribbean,x0_Seabourn,x0_Silversea,x0_Star,x0_Windstar
1,Azamara,6.0,30.277,6.94,5.94,3.55,3.55,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Carnival,11.0,110.0,29.74,9.53,14.88,19.1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Carnival,17.0,101.353,26.42,8.92,13.21,10.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Carnival,22.0,70.367,20.52,8.55,10.2,9.2,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Carnival,15.0,70.367,20.52,8.55,10.2,9.2,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [943]:
ndf.isnull().sum()

Cruise_line             29
Age                     29
Tonnage                 29
passengers              29
length                  29
cabins                  29
crew                    29
x0_Azamara              29
x0_Carnival             29
x0_Celebrity            29
x0_Costa                29
x0_Crystal              29
x0_Cunard               29
x0_Disney               29
x0_Holland_American     29
x0_MSC                  29
x0_Norwegian            29
x0_Oceania              29
x0_Orient               29
x0_P&O                  29
x0_Princess             29
x0_Regent_Seven_Seas    29
x0_Royal_Caribbean      29
x0_Seabourn             29
x0_Silversea            29
x0_Star                 29
x0_Windstar             29
dtype: int64

In [944]:
# ndf= ndf.dropna(axis = 0 )

In [945]:
# ndf = ndf.drop('Cruise_line', axis = 1)

In [946]:
ndf.head()

Unnamed: 0,Cruise_line,Age,Tonnage,passengers,length,cabins,crew,x0_Azamara,x0_Carnival,x0_Celebrity,...,x0_Oceania,x0_Orient,x0_P&O,x0_Princess,x0_Regent_Seven_Seas,x0_Royal_Caribbean,x0_Seabourn,x0_Silversea,x0_Star,x0_Windstar
1,Azamara,6.0,30.277,6.94,5.94,3.55,3.55,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Carnival,11.0,110.0,29.74,9.53,14.88,19.1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Carnival,17.0,101.353,26.42,8.92,13.21,10.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Carnival,22.0,70.367,20.52,8.55,10.2,9.2,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Carnival,15.0,70.367,20.52,8.55,10.2,9.2,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [947]:
x = ndf.drop(['crew'], axis = 1)
y = ndf['crew']

In [948]:
x.head()

Unnamed: 0,Cruise_line,Age,Tonnage,passengers,length,cabins,x0_Azamara,x0_Carnival,x0_Celebrity,x0_Costa,...,x0_Oceania,x0_Orient,x0_P&O,x0_Princess,x0_Regent_Seven_Seas,x0_Royal_Caribbean,x0_Seabourn,x0_Silversea,x0_Star,x0_Windstar
1,Azamara,6.0,30.277,6.94,5.94,3.55,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Carnival,11.0,110.0,29.74,9.53,14.88,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Carnival,17.0,101.353,26.42,8.92,13.21,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Carnival,22.0,70.367,20.52,8.55,10.2,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Carnival,15.0,70.367,20.52,8.55,10.2,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [949]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state= 10)

In [950]:
x_train.shape, x_test.shape

((111, 26), (28, 26))

In [951]:
x_train.columns

Index(['Cruise_line', 'Age', 'Tonnage', 'passengers', 'length', 'cabins',
       'x0_Azamara', 'x0_Carnival', 'x0_Celebrity', 'x0_Costa', 'x0_Crystal',
       'x0_Cunard', 'x0_Disney', 'x0_Holland_American', 'x0_MSC',
       'x0_Norwegian', 'x0_Oceania', 'x0_Orient', 'x0_P&O', 'x0_Princess',
       'x0_Regent_Seven_Seas', 'x0_Royal_Caribbean', 'x0_Seabourn',
       'x0_Silversea', 'x0_Star', 'x0_Windstar'],
      dtype='object')

In [952]:
scale_col = x_train.columns[:5].tolist()

In [953]:
scale_col

['Cruise_line', 'Age', 'Tonnage', 'passengers', 'length']

In [954]:
ndf_ss = StandardScaler()
scaled_train = ndf_ss.fit_transform(x_train[scale_col])
scaled_test = ndf_ss.transform(x_test[scale_col])

ValueError: could not convert string to float: 'Norwegian'

In [None]:
scaled_test

In [None]:
len(scaled_train), len(scaled_test)

In [None]:
scaled_train = pd.DataFrame(scaled_train, columns = scale_col)
scaled_test = pd.DataFrame(scaled_test, columns = scale_col)

In [None]:
scaled_train.head()

In [None]:
x_train.head()

In [None]:
x_train.iloc[:, :5]

In [None]:
 x_train.iloc[:, 5:]

In [None]:
# x)train의 인덱스를 초기화 해줌 
scaled_train = pd.concat([scaled_train, x_train.iloc[:, 5:].reset_index(drop = True)],
                         axis = 1)
scaled_test = pd.concat([scaled_test, x_test.iloc[:, 5:].reset_index(drop = True)],
                         axis = 1)

In [None]:
x_train

In [955]:
scaled_train.shape, scaled_test.shape

((64, 25), (17, 25))

In [956]:
len(y_train), len(y_test)

(111, 28)

In [957]:
len(scaled_train.columns)

25

In [958]:
y_train

81     10.00
17      9.20
55      5.30
48     12.53
108     5.20
       ...  
11       NaN
92      5.20
23     11.50
52       NaN
14      9.30
Name: crew, Length: 111, dtype: float64

In [959]:
lr = LinearRegression()

In [960]:
lr.fit(scaled_train, y_train)

ValueError: Input y contains NaN.

In [961]:
lr.coef_

AttributeError: 'LinearRegression' object has no attribute 'coef_'

In [962]:
lr.intercept_

AttributeError: 'LinearRegression' object has no attribute 'intercept_'

In [963]:
lr.score(scaled_test, y_test)

AttributeError: 'LinearRegression' object has no attribute 'coef_'

In [None]:
pred = lr.predict(scaled_test)

In [964]:
mse = mean_squared_error(y_test, pred)
mse 

ValueError: Found input variables with inconsistent numbers of samples: [28, 17]

In [890]:
# # 모델 최적화 
# x = sm.add_constant(scaled_train)

x_train_new = sm.add_constant(x_train)
x_test_new = sm.add_constant(x_test)

# full_mod ( 다 넣었다) / OLS : 선형회귀(최소제곱법)
full_mod = sm.OLS(y_train, x_train_new)

In [891]:
x_train.head()

Unnamed: 0,Age,Tonnage,passengers,length,cabins,x0_Azamara,x0_Carnival,x0_Celebrity,x0_Costa,x0_Crystal,...,x0_Oceania,x0_Orient,x0_P&O,x0_Princess,x0_Regent_Seven_Seas,x0_Royal_Caribbean,x0_Seabourn,x0_Silversea,x0_Star,x0_Windstar
64,13.0,63.0,14.4,7.77,7.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
47,44.0,70.327,17.91,9.63,9.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71,9.0,59.058,17.0,7.63,8.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,15.0,70.367,20.52,8.55,10.2,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
89,15.0,30.277,6.84,5.94,3.42,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [892]:
x_train_new.head()

Unnamed: 0,const,Age,Tonnage,passengers,length,cabins,x0_Azamara,x0_Carnival,x0_Celebrity,x0_Costa,...,x0_Oceania,x0_Orient,x0_P&O,x0_Princess,x0_Regent_Seven_Seas,x0_Royal_Caribbean,x0_Seabourn,x0_Silversea,x0_Star,x0_Windstar
64,1.0,13.0,63.0,14.4,7.77,7.2,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
47,1.0,44.0,70.327,17.91,9.63,9.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71,1.0,9.0,59.058,17.0,7.63,8.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,15.0,70.367,20.52,8.55,10.2,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
89,1.0,15.0,30.277,6.84,5.94,3.42,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [893]:
full_res = full_mod.fit()

In [894]:
print(full_res.summary())

                            OLS Regression Results                            
Dep. Variable:                   crew   R-squared:                       0.904
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     18.93
Date:                Thu, 28 Mar 2024   Prob (F-statistic):           3.95e-15
Time:                        12:29:49   Log-Likelihood:                -84.877
No. Observations:                  64   AIC:                             213.8
Df Residuals:                      42   BIC:                             261.2
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.5200 

In [895]:
cnames = x_train.columns

In [896]:
cnames

Index(['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'x0_Azamara',
       'x0_Carnival', 'x0_Celebrity', 'x0_Costa', 'x0_Crystal', 'x0_Cunard',
       'x0_Disney', 'x0_Holland_American', 'x0_MSC', 'x0_Norwegian',
       'x0_Oceania', 'x0_Orient', 'x0_P&O', 'x0_Princess',
       'x0_Regent_Seven_Seas', 'x0_Royal_Caribbean', 'x0_Seabourn',
       'x0_Silversea', 'x0_Star', 'x0_Windstar'],
      dtype='object')

In [897]:
for i in np.arange(len(cnames)):
    xvars = list(cnames)
    yvar = xvars.pop(i)
    mod = sm.OLS(x_train[yvar], sm.add_constant(x_train_new[xvars]))
    res = mod.fit()
    vif = 1/(1 - res.rsquared)
    print(yvar, round(vif, 3))

Age 3.151
Tonnage 48.312
passengers 51.909
length 9.472
cabins 84.904
x0_Azamara nan
x0_Carnival inf
x0_Celebrity inf
x0_Costa inf
x0_Crystal inf
x0_Cunard inf
x0_Disney nan
x0_Holland_American inf
x0_MSC inf
x0_Norwegian inf
x0_Oceania inf
x0_Orient nan
x0_P&O inf
x0_Princess inf
x0_Regent_Seven_Seas inf
x0_Royal_Caribbean inf
x0_Seabourn inf
x0_Silversea inf
x0_Star inf
x0_Windstar inf


  return 1 - self.ssr/self.centered_tss
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  return 1 - self.ssr/self.centered_tss
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  return 1 - self.ssr/self.centered_tss
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)


#### 변수 제거 
- x0_Orient  

In [898]:
columns = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'x0_Azamara',
       'x0_Carnival', 'x0_Celebrity', 'x0_Costa', 'x0_Crystal', 'x0_Cunard',
       'x0_Disney', 'x0_Holland_American', 'x0_MSC', 'x0_Norwegian',
       'x0_Oceania', 'x0_Orient', 'x0_P&O', 'x0_Princess',
       'x0_Regent_Seven_Seas', 'x0_Royal_Caribbean', 'x0_Seabourn',
       'x0_Silversea', 'x0_Star', 'x0_Windstar']

pdx = ndf[columns]
pdy = ndf["crew"]

In [899]:
cnames

Index(['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'x0_Azamara',
       'x0_Carnival', 'x0_Celebrity', 'x0_Costa', 'x0_Crystal', 'x0_Cunard',
       'x0_Disney', 'x0_Holland_American', 'x0_MSC', 'x0_Norwegian',
       'x0_Oceania', 'x0_Orient', 'x0_P&O', 'x0_Princess',
       'x0_Regent_Seven_Seas', 'x0_Royal_Caribbean', 'x0_Seabourn',
       'x0_Silversea', 'x0_Star', 'x0_Windstar'],
      dtype='object')

In [900]:
# residual_sugar를 날리고 다시 선형회귀를 할 것임 
columns = ['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'x0_Azamara',
       'x0_Carnival', 'x0_Celebrity', 'x0_Costa', 'x0_Crystal', 'x0_Cunard',
       'x0_Disney', 'x0_Holland_American', 'x0_MSC', 'x0_Norwegian',
       'x0_Oceania','x0_P&O', 'x0_Princess',
       'x0_Regent_Seven_Seas', 'x0_Royal_Caribbean', 'x0_Seabourn',
       'x0_Silversea', 'x0_Star', 'x0_Windstar']

pdx = ndf[columns]
pdy = ndf["crew"]
# 데이터 분할
x_train, x_test, y_train, y_test = train_test_split(pdx, pdy, test_size = 0.2,
                                                    random_state = 10)

x_train_new = sm.add_constant(x_train)
x_test_new = sm.add_constant(x_test)

# full_mod ( 다 넣었다) / OLS : 선형회귀(최소제곱법)
full_mod = sm.OLS(y_train, x_train_new)

full_res = full_mod.fit()
print(full_res.summary())

                            OLS Regression Results                            
Dep. Variable:                   crew   R-squared:                       0.904
Model:                            OLS   Adj. R-squared:                  0.857
Method:                 Least Squares   F-statistic:                     18.93
Date:                Thu, 28 Mar 2024   Prob (F-statistic):           3.95e-15
Time:                        12:29:50   Log-Likelihood:                -84.877
No. Observations:                  64   AIC:                             213.8
Df Residuals:                      42   BIC:                             261.2
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.5200 

In [901]:
cnames = x_train.columns
for i in np.arange(len(cnames)):
    xvars = list(cnames)
    yvar = xvars.pop(i)
    mod = sm.OLS(x_train[yvar], sm.add_constant(x_train_new[xvars]))
    res = mod.fit()
    vif = 1/(1 - res.rsquared)
    print(yvar, round(vif, 3))

Age 3.151
Tonnage 48.312
passengers 51.909
length 9.472
cabins 84.904
x0_Azamara nan
x0_Carnival inf
x0_Celebrity inf
x0_Costa inf
x0_Crystal inf
x0_Cunard inf
x0_Disney nan
x0_Holland_American inf
x0_MSC inf
x0_Norwegian inf
x0_Oceania inf
x0_P&O inf
x0_Princess inf
x0_Regent_Seven_Seas inf
x0_Royal_Caribbean inf
x0_Seabourn inf
x0_Silversea inf
x0_Star inf
x0_Windstar inf


  return 1 - self.ssr/self.centered_tss
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  return 1 - self.ssr/self.centered_tss
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)
  vif = 1/(1 - res.rsquared)


In [902]:
# 싸이킷런을 안써서 온몸 비틀기중 

# full_res.predict : 학습된 모델의 predict

# x_test의 예측 수행
y_pred = full_res.predict(x_test_new)

# 예측수행을 데이터프레임 만들어주고 
y_pred_df = pd.DataFrame(y_pred)

# 컬럼명 변경
y_pred_df.columns = ["y_pred"]

# y_test는 정답값 그 정답값도 데이터프레임으로 만듦 
pred_data = pd.DataFrame((y_pred_df["y_pred"]))

y_test_new = pd.DataFrame(y_test)
y_test_new.reset_index(inplace=True)


pred_data["y_test"] =pd.DataFrame(y_test_new["crew"])

In [903]:
# rsqd 결정계수를 구해서 
rsqd = r2_score(y_test_new["crew"].tolist(), y_pred_df["y_pred"].tolist())

# 소숫점 4째자리 까지 구하기 
# 처음의 결과에서 0.343 -> 과소적합이 일어남 
print(round(rsqd, 4))

0.9151


In [904]:
# a_rr = y_pred

In [905]:
# a_rr = np.reshape(a_rr, (-1, 1))

In [906]:
# a_rr

In [907]:
pred_data

Unnamed: 0,y_pred,y_test
5,9.709145,4.6
94,8.432915,
55,4.675091,
72,3.151119,
77,10.473284,
57,4.205062,
28,9.042397,
93,8.444751,
62,6.844136,
38,11.165439,


In [908]:
y_test.shape

(17,)