### Polynomial Regression Task

##### 한국인 수익 예측
- id : 식별 번호
- year : 조사 년도
- wave : 2005년 wave 1위부터 2018년 wave 14위까지
- region: 1)서울 2)경기 3)경남 4)경북 5)충남 6)강원 & 충북 7)전라 & 제주
- income: 연간 수입 M원(백만원.1100원=1USD)
- family_member: 가족 구성원 수
- gender: 1) 남성 2) 여성
- year_born: 태어난 년도
- education_level:1)무교육(7세 미만) 2)무교육(7세 이상) 3)초등학교 4)중학교 5)고등학교 6)대학 학위 8)MA 9)박사 학위
- marriage: 혼인상태. 1)해당없음(18세 미만) 2)혼인중 3)사망으로 별거중 4)별거중 5)미혼 6)기타
- religion: 1) 종교 있음 2) 종교 없음  
- occupation: 직종 코드, 별도 첨부
- company_size: 기업 규모
- reason_none_worker: 1)능력 없음 2)군 복무 중 3)학교에서 공부 중 4)학교 준비 5)직장인 7)집에서 돌보는 아이들 8)간호 9)경제 활동 포기 10)일할 의사 없음 11)기타

In [1]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_squared_log_error

def get_evaluation(y_test, prediction):
    MAE =  mean_absolute_error(y_test, prediction)
    MSE = mean_squared_error(y_test, prediction)
    RMSE = np.sqrt(MSE)
    MSLE = mean_squared_log_error(y_test, prediction)
    RMSLE = np.sqrt(mean_squared_log_error(y_test, prediction))
    R2 = r2_score(y_test, prediction)

    print('MAE: {:.4f}, MSE: {:.2f}, RMSE: {:.2f}, MSLE: {:.2f}, RMSLE: {:.2f}, R2: {:.2f}'.format(MAE, MSE, RMSE, MSLE, RMSLE, R2))

In [2]:
import pandas as pd

income_df = pd.read_csv('./datasets/korean_income.csv')
income_df

Unnamed: 0,id,year,wave,region,income,family_member,gender,year_born,education_level,marriage,religion,occupation,company_size,reason_none_worker
0,10101,2005,1,1,614.0,1,2,1936,2,2,2,,,8
1,10101,2011,7,1,896.0,1,2,1936,2,2,2,,,10
2,10101,2012,8,1,1310.0,1,2,1936,2,2,2,,,10
3,10101,2013,9,1,2208.0,1,2,1936,2,2,2,,,1
4,10101,2014,10,1,864.0,1,2,1936,2,2,2,,,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92852,98000701,2014,10,5,11600.0,6,1,1967,5,1,1,874,1,
92853,98000701,2015,11,5,8327.0,6,1,1967,5,1,1,874,1,
92854,98000701,2016,12,5,7931.0,6,1,1967,5,1,1,874,1,
92855,98000701,2017,13,5,8802.0,5,1,1967,5,1,1,874,1,


In [3]:
income_df.isna().sum()

id                    0
year                  0
wave                  0
region                0
income                0
family_member         0
gender                0
year_born             0
education_level       0
marriage              0
religion              0
occupation            0
company_size          0
reason_none_worker    0
dtype: int64

In [4]:
income_df.duplicated().sum()

0

In [5]:
income_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92857 entries, 0 to 92856
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  92857 non-null  int64  
 1   year                92857 non-null  int64  
 2   wave                92857 non-null  int64  
 3   region              92857 non-null  int64  
 4   income              92857 non-null  float64
 5   family_member       92857 non-null  int64  
 6   gender              92857 non-null  int64  
 7   year_born           92857 non-null  int64  
 8   education_level     92857 non-null  int64  
 9   marriage            92857 non-null  int64  
 10  religion            92857 non-null  int64  
 11  occupation          92857 non-null  object 
 12  company_size        92857 non-null  object 
 13  reason_none_worker  92857 non-null  object 
dtypes: float64(1), int64(10), object(3)
memory usage: 9.9+ MB


In [6]:
from sklearn.preprocessing import LabelEncoder

columns = ['occupation', 'company_size', 'reason_none_worker']
encoders = []
for column in columns:
    encoder = LabelEncoder()
    category = encoder.fit_transform(income_df[column])
    income_df[column] = category
    encoders.append(encoder)

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

min_value = income_df[income_df['income'] > 0]['income'].min()

print(min_value)

income_df.loc[income_df['income'] < 0, 'income'] = 2.0

features, targets = income_df.iloc[:, :-1], income_df.income

X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=0)

# 로그 변환
y_train_log = np.log1p(y_train)

print(y_train_log.isna().sum())
print((targets <= 0).sum())

linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train_log)

prediction = linear_regression.predict(X_test)
get_evaluation(np.log1p(y_test), prediction)


2.0
0
0
MAE: 0.3213, MSE: 0.41, RMSE: 0.64, MSLE: 0.01, RMSLE: 0.09, R2: 0.58


In [11]:
income_df.isna().sum()

id                    0
year                  0
wave                  0
region                0
income                0
family_member         0
gender                0
year_born             0
education_level       0
marriage              0
religion              0
occupation            0
company_size          0
reason_none_worker    0
dtype: int64

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures


features, targets = income_df.iloc[:, :-1], income_df.income

poly_features = PolynomialFeatures(degree=1).fit_transform(features)

X_train, X_test, y_train, y_test = train_test_split(poly_features, targets, test_size=0.2, random_state=0)

# 로그 변환
y_train = np.log1p(y_train)

linear_regression = LinearRegression()

linear_regression.fit(X_train, y_train)

prediction = linear_regression.predict(X_test)
get_evaluation(np.log1p(y_test), prediction)

MAE: 0.3213, MSE: 0.41, RMSE: 0.64, MSLE: 0.01, RMSLE: 0.09, R2: 0.58
