###XGBoost Assignment
###Wage prediction

In this assignment it is expected to predict whether a person
makes over 50K per year or not from classic adult dataset using XGBoost.

**Attribute Information:**
```
wage_class: >=50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, TrinadTobago, Peru, Hong, Holand-Netherlands.
```



**Import Packages**

In [1]:
import numpy as np
import pandas as pd
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)
test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test' , skiprows = 1, header = None)
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 
              'occupation','relationship', 'race', 'sex', 'capital_gain',
              'capital_loss', 'hours_per_week', 'native_country', 'wage_class']
train_set.columns = col_labels
test_set.columns = col_labels

In [2]:
df_train = train_set.copy()
df_train.shape

(32561, 15)

In [3]:
df_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**EDA**

In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  wage_class      32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [5]:
df_train.shape

(32561, 15)

In [6]:
df_train.workclass.unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [7]:
#The ? is left as an other catogory for certain columns. Hence let us include these data and carefully encode them.
df_train[(df_train.workclass == ' ?') & (df_train.occupation == ' ?')]

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
27,54,?,180211,Some-college,10,Married-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0,0,60,South,>50K
61,32,?,293936,7th-8th,4,Married-spouse-absent,?,Not-in-family,White,Male,0,0,40,?,<=50K
69,25,?,200681,Some-college,10,Never-married,?,Own-child,White,Male,0,0,40,United-States,<=50K
77,67,?,212759,10th,6,Married-civ-spouse,?,Husband,White,Male,0,0,2,United-States,<=50K
106,17,?,304873,10th,6,Never-married,?,Own-child,White,Female,34095,0,32,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32530,35,?,320084,Bachelors,13,Married-civ-spouse,?,Wife,White,Female,0,0,55,United-States,>50K
32531,30,?,33811,Bachelors,13,Never-married,?,Not-in-family,Asian-Pac-Islander,Female,0,0,99,United-States,<=50K
32539,71,?,287372,Doctorate,16,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,>50K
32541,41,?,202822,HS-grad,9,Separated,?,Not-in-family,Black,Female,0,0,32,United-States,<=50K


In [8]:
df_test = test_set.copy()
df_test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


In [9]:
df_test.shape

(16281, 15)

In [10]:
df_test.wage_class.unique()

array([' <=50K.', ' >50K.'], dtype=object)

In [11]:
#Let us assign all wage_class values to -1 (dummy value) for concatination. It will help us to identify test data after preprocessing
df_test.wage_class = -1

In [12]:
df_test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,-1
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,-1
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,-1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,-1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,-1


In [13]:
#Combine train and test sets
df_whole = df_train.append(df_test)

In [14]:
df_whole.shape

(48842, 15)

In [15]:
df_whole.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48842 entries, 0 to 16280
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education_num   48842 non-null  int64 
 5   marital_status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital_gain    48842 non-null  int64 
 11  capital_loss    48842 non-null  int64 
 12  hours_per_week  48842 non-null  int64 
 13  native_country  48842 non-null  object
 14  wage_class      48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 6.0+ MB


In [16]:
df_whole.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Data Preprocessing**

In [17]:
df_whole.workclass.unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [18]:
df_whole.sex.unique()

array([' Male', ' Female'], dtype=object)

In [19]:
df_whole.duplicated().sum()

30

In [20]:
df_whole.drop_duplicates(keep='first', inplace=True)

In [21]:
df_whole.duplicated().sum()

0

In [22]:
df_whole.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
wage_class        0
dtype: int64

In [23]:
df_whole.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48812 entries, 0 to 16280
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48812 non-null  int64 
 1   workclass       48812 non-null  object
 2   fnlwgt          48812 non-null  int64 
 3   education       48812 non-null  object
 4   education_num   48812 non-null  int64 
 5   marital_status  48812 non-null  object
 6   occupation      48812 non-null  object
 7   relationship    48812 non-null  object
 8   race            48812 non-null  object
 9   sex             48812 non-null  object
 10  capital_gain    48812 non-null  int64 
 11  capital_loss    48812 non-null  int64 
 12  hours_per_week  48812 non-null  int64 
 13  native_country  48812 non-null  object
 14  wage_class      48812 non-null  object
dtypes: int64(6), object(9)
memory usage: 6.0+ MB


In [24]:
df_whole[df_whole.eq(' ?').any(1)]

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,?,>50K
27,54,?,180211,Some-college,10,Married-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0,0,60,South,>50K
38,31,Private,84154,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,38,?,>50K
51,18,Private,226956,HS-grad,9,Never-married,Other-service,Own-child,White,Female,0,0,30,?,<=50K
61,32,?,293936,7th-8th,4,Married-spouse-absent,?,Not-in-family,White,Male,0,0,40,?,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16208,21,?,212661,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,-1
16239,73,?,144872,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,25,Canada,-1
16251,81,?,26711,Assoc-voc,11,Married-civ-spouse,?,Husband,White,Male,2936,0,20,United-States,-1
16265,50,Local-gov,139347,Masters,14,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,40,?,-1


In [25]:
df_whole.race.unique()

array([' White', ' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo',
       ' Other'], dtype=object)

In [26]:
df_whole.wage_class.unique()

array([' <=50K', ' >50K', -1], dtype=object)

In [27]:
#map wage_classes to 0 and 1. 0 (<=50k), 1(>=50k)
df_whole.wage_class = df_whole.wage_class.map({' <=50K':0, ' >50K':1, -1:-1})

In [28]:
df_whole.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48812 entries, 0 to 16280
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48812 non-null  int64 
 1   workclass       48812 non-null  object
 2   fnlwgt          48812 non-null  int64 
 3   education       48812 non-null  object
 4   education_num   48812 non-null  int64 
 5   marital_status  48812 non-null  object
 6   occupation      48812 non-null  object
 7   relationship    48812 non-null  object
 8   race            48812 non-null  object
 9   sex             48812 non-null  object
 10  capital_gain    48812 non-null  int64 
 11  capital_loss    48812 non-null  int64 
 12  hours_per_week  48812 non-null  int64 
 13  native_country  48812 non-null  object
 14  wage_class      48812 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 6.0+ MB


In [29]:
#remove spaces and lower the characters in categorical columns
for column in df_whole.columns:
    if df_whole[column].dtype == 'object':
        df_whole[column] = df_whole[column].apply(lambda x: x.replace(' ', '').lower())

In [30]:
df_whole.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,state-gov,77516,bachelors,13,never-married,adm-clerical,not-in-family,white,male,2174,0,40,united-states,0
1,50,self-emp-not-inc,83311,bachelors,13,married-civ-spouse,exec-managerial,husband,white,male,0,0,13,united-states,0
2,38,private,215646,hs-grad,9,divorced,handlers-cleaners,not-in-family,white,male,0,0,40,united-states,0
3,53,private,234721,11th,7,married-civ-spouse,handlers-cleaners,husband,black,male,0,0,40,united-states,0
4,28,private,338409,bachelors,13,married-civ-spouse,prof-specialty,wife,black,female,0,0,40,cuba,0


In [31]:
#Encoding the categorical columns
df_whole = pd.get_dummies(df_whole, columns = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race','sex','native_country'])

In [32]:
df_whole.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,wage_class,workclass_?,workclass_federal-gov,workclass_local-gov,...,native_country_portugal,native_country_puerto-rico,native_country_scotland,native_country_south,native_country_taiwan,native_country_thailand,native_country_trinadad&tobago,native_country_united-states,native_country_vietnam,native_country_yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
#separating cleaned data
test_cleaned = df_whole[df_whole.wage_class == -1]

In [34]:
train_cleaned = df_whole[df_whole.wage_class != -1]

**Building the model**

In [35]:
X = train_cleaned.drop(['wage_class'], axis=1)
y = train_cleaned.wage_class

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [37]:
from xgboost import XGBClassifier
xgb = XGBClassifier(tree_method = 'gpu_hist')
xgb.fit(X_train, y_train)

XGBClassifier(tree_method='gpu_hist')

In [38]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, xgb.predict(X_test))

0.8693915181315304

In [39]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X_train_scaled = scale.fit_transform(X_train)
X_test_scaled = scale.transform(X_test)

In [40]:
xgb_scaled = XGBClassifier()
xgb_scaled.fit(X_train_scaled, y_train)

XGBClassifier()

In [41]:
xgb_pred = xgb_scaled.predict(X_test_scaled)

In [42]:
accuracy_score(y_test, xgb_pred)

0.870159803318992

**Hyperparameter tuning**

In [43]:
from sklearn.model_selection import GridSearchCV

In [45]:
clf_xgb = XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': [10, 100, 1000],
              'learning_rate': [0.001, 0.01, 0.1],
              'subsample': [0.3, 0.6, 0.8],
              'max_depth': [2, 3, 4, 5, 6],
              'tree_method': ['gpu_hist']
            #   'colsample_bytree': [stats.uniform(0.5, 0.4)],
            #   'min_child_weight': [1, 2, 3, 4]
             }


clf = GridSearchCV(clf_xgb, 
                   param_dist,
                         cv = 5,
                         scoring = 'accuracy', 
                         verbose = 3, 
                         n_jobs = -1)   

In [46]:
clf.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 135 candidates, totalling 675 fits


GridSearchCV(cv=5, estimator=XGBClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.001, 0.01, 0.1],
                         'max_depth': [2, 3, 4, 5, 6],
                         'n_estimators': [10, 100, 1000],
                         'subsample': [0.3, 0.6, 0.8],
                         'tree_method': ['gpu_hist']},
             scoring='accuracy', verbose=3)

In [47]:
clf.best_score_

0.8714123177904163

In [48]:
xgb_clf = XGBClassifier(**clf.best_params_)
xgb_clf.fit(X_train_scaled, y_train)

XGBClassifier(max_depth=2, n_estimators=1000, subsample=0.8,
              tree_method='gpu_hist')

In [49]:
xgb_pred = xgb_clf.predict(X_test_scaled)

In [50]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, xgb_pred)

0.8752304855562385

In [51]:
import pickle
pickle.dump(xgb_clf, open('model.pkl','wb'))

In [52]:
test_cleaned.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,wage_class,workclass_?,workclass_federal-gov,workclass_local-gov,...,native_country_portugal,native_country_puerto-rico,native_country_scotland,native_country_south,native_country_taiwan,native_country_thailand,native_country_trinadad&tobago,native_country_united-states,native_country_vietnam,native_country_yugoslavia
0,25,226802,7,0,0,40,-1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,89814,9,0,0,50,-1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,28,336951,12,0,0,40,-1,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,44,160323,10,7688,0,40,-1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,18,103497,10,0,0,30,-1,1,0,0,...,0,0,0,0,0,0,0,1,0,0


In [53]:
test_cleaned.drop(['wage_class'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [54]:
test_cleaned.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_?,workclass_federal-gov,workclass_local-gov,workclass_never-worked,...,native_country_portugal,native_country_puerto-rico,native_country_scotland,native_country_south,native_country_taiwan,native_country_thailand,native_country_trinadad&tobago,native_country_united-states,native_country_vietnam,native_country_yugoslavia
0,25,226802,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,89814,9,0,0,50,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,28,336951,12,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,44,160323,10,7688,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,18,103497,10,0,0,30,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [56]:
test_cleaned_scaled = scale.transform(test_cleaned)

In [57]:
prediction = xgb_clf.predict(test_cleaned_scaled)

In [62]:
test_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


**Comparing the actual test wage_class with predicted**

In [60]:
comparison_df = pd.DataFrame(prediction, columns=['prediction'])

In [65]:
comparison_df['actual'] = test_set['wage_class'].apply(lambda x: 0 if x.strip()=='<=50K.' else 1)

In [66]:
comparison_df['actual'].unique()

array([0, 1])

In [67]:
accuracy_score(comparison_df['actual'], comparison_df['prediction'])

0.6657450076804915

In [68]:
comparison_df.to_csv('test_predications.csv', index=False)

Although, the prediction did not yield a good score for the test data even after using the tuned model, we may try with different other techniques like deep learning models, PCA to get a better score.
<br> This analysis and prediction is only for education purpose.