# 5. Logistic Regression (1)

**RECAP:** From **4. Test/Train Split and Exploratory Data Analysis** we have a training data ready for logistic regression.  
  
The current goal will be to perform a logistic regression on a subset of the train data in order to identify features that do not contribute to explaning/predicting the target variable (prediabetes status). This information will justify removing these features in order to create space to add features from the *observations* table that was so far not included in the analysis.

## Prep

### Import modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
import seaborn as sns

### Load training data

In [None]:
X_train = pd.read_csv ('X_train.csv', index_col = 0)
y_train = pd.read_csv ('y_train.csv', index_col = 0)
y_train = y_train['prediabetes_bin_y'].squeeze()

### Use only 33% of training data

Only 33% of the training data will be used to optimise the hyperparameters for logistic regression. Then, the full training dataset will be used to create another train/test (validation) split in order to assess the performance of the model and obtain coefficient values.

In [None]:
X_train_33, _, y_train_33, _ = train_test_split (X_train, y_train, test_size=0.67, stratify=y_train)

del X_train, y_train

## Optimise hyperparameters

A pipeline is constructed to scale the data and test various options for pre-processing and logistic regression parameters such as:
- dimensionality reduction through PCA
- regularisation
    - Ridge
    - Lasso
- penalty values

In [None]:
# Define pre-processing and  hyperparameter options
scaling_options = [MinMaxScaler()] # Scaling
pca_options = [None, 5] # Dimensionality reduction options
penalty_options = ['l1', 'l2'] # Lasso or Ridge regression
C_options = [0.01, 0.1, 0, 1, 10] # Coefficient penalties

In [None]:
# Initialize a pipeline without any parameters:
pipe = Pipeline (
    [
        ('scaling', MinMaxScaler),
        ('pca', PCA(n_components = 2)),
        ('logreg', LogisticRegression(max_iter = 500))
    ]
)

In [None]:
# Define parameters for grid search
param_grid = [
    {
        'scaling' : scaling_options,
        'pca' : pca_options,
        'logreg__penalty' : penalty_options,
        'logreg__C' : C_options
    }
]

A grid search is performed using the 33% of the training dataset defined above and a 10-fold cross-validation strategy.

In [None]:
# Instantiate the grid search with the pipeline and hyperparameters:
grid_search = GridSearchCV (pipe, # pipeline initiated
                            param_grid = param_grid, # grid parameter options
                            cv = 10) # Use cross-validation of 10-fold

# Fit the grid search for best logistic regression parameter on train data
grid_search.fit (X_train_33, y_train_33)

Identify the best parameters for logistic regression based on 33% of the training data.

In [None]:
# Extract the best model:
best_model = grid_search.best_estimator_
# Extract the best hyperparameters:
best_params = grid_search.best_params_
print("Best Hyperparameters:\n", best_params)

In [None]:
# Clean up:
del X_train_33, y_train_33, best_model

## Logistic regression

Logistic regression will be performed on all the training data.

In [None]:
# Load training data again:
X_train = pd.read_csv ('X_train.csv', index_col = 0)
y_train = pd.read_csv ('y_train.csv', index_col = 0)
y_train = y_train['prediabetes_bin_y'].squeeze()

## Train a logistic regression with the best parameters discovered:
# 1. Scale data using
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform (X_train)
# 2. Instantiate Logistic Regression
logreg = LogisticRegression (penalty = 'l2',
                             C = 10,
                             max_iter = 1000) # Increase the maximum number of iterations
# 3. Fit Logistic Regression on the scaled train dataset
logreg.fit (X_train_scaled, y_train)

# Predict rating for train:
lg_best_train = logreg.predict (X_train_scaled)
# Training performance:
report_train = classification_report (y_train, lg_best_train)
print(report_train)

The logistic regression did not perform very well. Despite this, information from this model will be used to further clean and modify the data in order to increase performance.

## Coefficient importance

Coefficients will be extracted and saved.

In [None]:
## Obtain a measure of feature importance:

# Write into a dataframe:
coef_df = pd.DataFrame ({'Features' : ['Intercept'] + list (X_train.columns),
                         'Coefficients' : [logreg.intercept_[0]] +\
                                           list(logreg.coef_[0])
                        })
# Sort in descending order:
coef_df.sort_values (by = ['Coefficients'], ascending = False, inplace = True)
# Print table:
print (coef_df)
#####

## Save in a csv file
coef_df.to_csv ('LogReg_coefficients_careplans_observations.csv')
#####

Next steps will include modifying the data to exclude unnecessary features based on their coefficient magnitude and to introduce new features from the *observations* table.

---