<a href="https://colab.research.google.com/github/adichat08/Support-Vector-Classifier-for-Predicting-Survival-Likelihood-of-Hepatitis-Patients/blob/main/Binning_and_Polynomial_Features_for_Improvement_of_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---



# Binning and Polynomial Features for Improvement of LogisticRegression

Linear models, as implied by the name, usually model a linear relationship between features and the target. One thing that can make this linear relationship more complex is binning of certain features in the dataset. Essentially, binning will split each feature up into multiple new features, allowing the model to make a representation for each one of those new features. In this case, it is likely to add some complexity to the model, allowing it to consider many more factors about the input data when making decisions.

Adding polynomial features, or powers of the original feature, can also, in many cases, help improve performance by changing the linear model's representation of a feature from a line to a curve. This will create a more complex model.

In [None]:
# applying the column transformer to scale the data
X_binned = ct.fit_transform(X_train)
# creating the KBinsDiscretizer object to split each features into 5 new ones
kb = KBinsDiscretizer(n_bins=5,strategy='quantile')
# applying the KBinsDiscretizer object on the training data
kb.fit(X_binned)
X_binned = kb.transform(X_binned)

  'decreasing the number of bins.' % jj)


In [None]:
# Running a cross validation with the linear model on the binned data
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
log = LogisticRegression(random_state=42)
print("Cross-validation scores:\n{}".format(
      cross_val_score(log,X_binned,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(log,X_binned,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(log,X_binned,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(log,X_binned,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(log,X_binned,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.82051282 0.92307692 0.79487179]
Average score:
0.8461538461538461
Cross-validation AUC:
[0.875      0.86290323 0.78571429]
Average AUC:
0.841205837173579
F1 average:
0.5204795204795205


In [None]:
# adding polynomial features(degree 5) of the original dataset
poly = PolynomialFeatures(degree=5, include_bias=False)
poly.fit(X_binned)
X_poly = poly.transform(X_binned)

In [None]:
# Running a cross validation with the linear model on the data holding the polynomial features
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
log = LogisticRegression(random_state=42,max_iter=1000,)
print("Cross-validation scores:\n{}".format(
      cross_val_score(log,X_poly,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(log,X_poly,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(log,X_poly,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(log,X_poly,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(log,X_poly,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.84615385 0.82051282 0.84615385]
Average score:
0.8376068376068376
Cross-validation AUC:
[0.79017857 0.84274194 0.81696429]
Average AUC:
0.8166282642089094
F1 average:
0.4444444444444445


Neither binning nor adding polynomials of the original features appear to improve the model's performance on any of the key metrics. This could be because LogisticRegression doesn't simply draw a line for its feature representations, and is likely to build more complex models than, say, LinearRegression. It's also a possibility that there is a fairly linear relationships between the features and the target.