<a href="https://colab.research.google.com/github/adichat08/Support-Vector-Classifier-for-Predicting-Survival-Likelihood-of-Hepatitis-Patients/blob/main/Log_Transform.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---



# Log Transform

Neural networks and linear models both happen to depend heavily on the specific distribution of each attribute when making decisions. For example, if some of the feature values are enormous compared to many others, then it could make it difficult to properly train the machine learning models. Applying non-linear transformations(like a log transform) will help preserve the information that can be gathered from the data while making it easier for the model to learn from the patterns present.

In [None]:
# applying a log transform on the columns holding continuous data(training dataset)
X_train_log = np.log(X_train.iloc[:,:5])
X_train_log_copy = X_train.copy()
X_train_log_copy[['AGE','BILIRUBIN','ALK PHOSPHATE','SGOT','ALBUMIN']] = X_train_log

In [None]:
# applying a log transform on the columns holding continuous data(test dataset)
X_test_log = np.log(X_test.iloc[:,:5])
X_test_log_copy = X_test.copy()
X_test_log_copy[['AGE','BILIRUBIN','ALK PHOSPHATE','SGOT','ALBUMIN']] = X_test_log

In [None]:
# Running a cross validation with the linear model on the transformed data
# while preprocessing inside the cross validation loop
log_pipe_cv = Pipeline([('preprocessing',ct),('log',LogisticRegression(random_state=42))])
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
log = LogisticRegression(random_state=42)
print("Cross-validation scores:\n{}".format(
      cross_val_score(log_pipe_cv,X_train_log_copy,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(log_pipe_cv,X_train_log_copy,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(log_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(log_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(log_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.87179487 0.87179487 0.84615385]
Average score:
0.8632478632478633
Cross-validation AUC:
[0.8125     0.90322581 0.81696429]
Average AUC:
0.8442300307219662
F1 average:
0.594017094017094


In [None]:
# Running a cross validation with the multilayer perceptron on the transformed data
# while preprocessing inside the cross validation loop
mlp_pipe_cv = Pipeline([('preprocessing',ct),('mlp',MLPClassifier(max_iter=2000,random_state=42))])
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
mlp = MLPClassifier(max_iter=2000,random_state=42)
print("Cross-validation scores:\n{}".format(
      cross_val_score(mlp_pipe_cv,X_train_log_copy,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(mlp_pipe_cv,X_train_log_copy,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(mlp_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(mlp_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(mlp_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.8974359  0.87179487 0.87179487]
Average score:
0.8803418803418803
Cross-validation AUC:
[0.75       0.78629032 0.8125    ]
Average AUC:
0.7829301075268816
F1 average:
0.6324786324786325


In [None]:
# Running a cross validation with the support vector machine on the transformed data
# while preprocessing inside the cross validation loop
svc_pipe_cv = Pipeline([('preprocessing',ct),('svm',SVC(random_state=42))])
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
svc = SVC(random_state=42)
print("Cross-validation scores:\n{}".format(
      cross_val_score(svc_pipe_cv,X_train_log_copy,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(svc_pipe_cv,X_train_log_copy,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(svc_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(svc_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(svc_pipe_cv,X_train_log_copy,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.87179487 0.84615385 0.92307692]
Average score:
0.8803418803418804
Cross-validation AUC:
[0.88839286 0.85080645 0.78571429]
Average AUC:
0.8416378648233486
F1 average:
0.6142191142191142


The log transform didn't appear to have a general positive impact on the performance of any of the models, per se. This could be because the data was scaled well enough when using only standardization, and nothing more was required.