**The objective of this step is to the modeling and evaluation phase, specifically focusing on the application of Logistic Regression to predict categorical targets. The evaluation includes comparing the performance of the model before and after outlier handling, as well as comparing the results of dimensionality reduction using Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA).**

**Modeling with Logistic Regression:**
Logistic Regression was chosen as the modeling algorithm due to its suitability for predicting categorical targets. It is a widely used algorithm for binary and multiclass classification tasks.

**Training and Testing:**
The dataset was split into training and testing sets to train the model and assess its performance on unseen data.

PCA with outliers:
       accuarcy:
       target1= 0.92
       target1= 0.89

LDA with outliers:
       accuarcy:
       target1 = 0.93
       target1 = 0.90

PCA without outliers:
       accuarcy:
       target 1:0.92
       target 2: 0.89

LDA without outliers:
       accuarcy:
       target 1:0.93
       target 2: 0.90

results show that LDA performs better than PCA in diffrent metrics i use it

## 1- import and reading data (with outlier and without it)

In [1]:
import sys

sys.path.append('../../scripts/utilities')
from helper_functions import *

sys.path.append('../../scripts/modeling')
from modeling import *

In [2]:
base_path = '../../data/processed_data/'
df_PCA_with_outlier = read_files('df_filling_missing_values_with_median_encoded_handle_noisy_normalized_highly_correlated_PCA_with_outliers.csv',
                 base_path=base_path)[0]
df_LDA_with_outlier = read_files('df_filling_missing_values_with_median_encoded_handle_noisy_normalized_highly_correlated_LDA_with_outliers.csv',
                 base_path=base_path)[0]

df_PCA_without_outlier = read_files('df_filling_missing_values_with_median_encoded_handle_noisy_handle_outlier_normalized_highly_correlated_perform_PCA.csv',
                 base_path=base_path)[0]

df_LDA_without_outlier = read_files('df_filling_missing_values_with_median_encoded_handle_noisy_handle_outlier_normalized_highly_correlated_perform_LDA.csv',
                 base_path=base_path)[0]

# train and evaluate PCA with outlier

## 2- Split the data into training and testing sets

In [3]:
X, y1, y2 = split_dataset(df_PCA_with_outlier)

## 3- Split the data into training and testing sets

In [4]:
X_train1, X_test1, y_train1, y_test1 = split_train_test(X, y1)
X_train2, X_test2, y_train2, y_test2 = split_train_test(X, y2)

## 4- Standardize the features

In [5]:
X_train_scaled1, X_test_scaled1 = standardize_features(X_train1, X_test1)
X_train_scaled2, X_test_scaled2 = standardize_features(X_train2, X_test2)

## 5- Create and train the logistic regression model

In [6]:
model_target1 = train_logistic_regression(X_train_scaled1, y_train1)
model_target2 = train_logistic_regression(X_train_scaled2, y_train2)

## 6- Make predictions on the test set

In [7]:
y_pred1 = predict(model_target1, X_test_scaled1)
y_pred2 = predict(model_target2, X_test_scaled2)

## 7- evaluate models

In [9]:
results1 = evaluate_model(model_target1, y_test1, y_pred1)


results2 = evaluate_model(model_target2, y_test2, y_pred2)
results1,results2

({'accuracy': 0.9253071253071253,
  'classification_report': '              precision    recall  f1-score   support\n\n         0.0       0.96      0.92      0.94       875\n         1.0       1.00      0.00      0.00        46\n         2.0       0.90      0.97      0.93      1114\n\n    accuracy                           0.93      2035\n   macro avg       0.95      0.63      0.62      2035\nweighted avg       0.93      0.93      0.91      2035\n',
  'confusion_matrix': array([[ 805,    0,   70],
         [   0,    0,   46],
         [  36,    0, 1078]], dtype=int64),
  'precision': 0.9253071253071253,
  'recall': 0.9253071253071253,
  'f1_score': 0.9253071253071253},
 {'accuracy': 0.8968058968058968,
  'classification_report': '              precision    recall  f1-score   support\n\n         0.0       0.96      0.92      0.94       875\n         1.0       0.00      0.00      0.00        98\n         2.0       0.86      0.96      0.91      1062\n\n    accuracy                        

# train and evaluate LDA with outlier

In [10]:
X, y1, y2 = split_dataset(df_LDA_with_outlier)

In [11]:
X_train1, X_test1, y_train1, y_test1 = split_train_test(X, y1)
X_train2, X_test2, y_train2, y_test2 = split_train_test(X, y2)

In [12]:
X_train_scaled1, X_test_scaled1 = standardize_features(X_train1, X_test1)
X_train_scaled2, X_test_scaled2 = standardize_features(X_train2, X_test2)

In [13]:
model_target1 = train_logistic_regression(X_train_scaled1, y_train1)
model_target2 = train_logistic_regression(X_train_scaled2, y_train2)

In [14]:
y_pred1 = predict(model_target1, X_test_scaled1)
y_pred2 = predict(model_target2, X_test_scaled2)

In [15]:
results1 = evaluate_model(model_target1, y_test1, y_pred1)
results1

{'accuracy': 0.9366093366093367,
 'classification_report': '              precision    recall  f1-score   support\n\n         0.0       0.97      0.93      0.95       875\n         1.0       1.00      0.00      0.00        46\n         2.0       0.91      0.98      0.94      1114\n\n    accuracy                           0.94      2035\n   macro avg       0.96      0.64      0.63      2035\nweighted avg       0.94      0.94      0.93      2035\n',
 'confusion_matrix': array([[ 818,    0,   57],
        [   0,    0,   46],
        [  26,    0, 1088]], dtype=int64),
 'precision': 0.9366093366093367,
 'recall': 0.9366093366093367,
 'f1_score': 0.9366093366093367}

In [16]:
results2 = evaluate_model(model_target2, y_test2, y_pred2)
results2

{'accuracy': 0.9085995085995086,
 'classification_report': '              precision    recall  f1-score   support\n\n         0.0       0.97      0.93      0.95       875\n         1.0       0.38      0.03      0.06        98\n         2.0       0.87      0.97      0.92      1062\n\n    accuracy                           0.91      2035\n   macro avg       0.74      0.64      0.64      2035\nweighted avg       0.89      0.91      0.89      2035\n',
 'confusion_matrix': array([[ 815,    0,   60],
        [   1,    3,   94],
        [  26,    5, 1031]], dtype=int64),
 'precision': 0.9085995085995086,
 'recall': 0.9085995085995086,
 'f1_score': 0.9085995085995086}

# train and evaluate PCA without outlier


In [17]:
X, y1, y2 = split_dataset(df_PCA_without_outlier)

In [18]:
X_train1, X_test1, y_train1, y_test1 = split_train_test(X, y1)
X_train2, X_test2, y_train2, y_test2 = split_train_test(X, y2)

In [19]:
X_train_scaled1, X_test_scaled1 = standardize_features(X_train1, X_test1)
X_train_scaled2, X_test_scaled2 = standardize_features(X_train2, X_test2)

In [20]:
model_target1 = train_logistic_regression(X_train_scaled1, y_train1)
model_target2 = train_logistic_regression(X_train_scaled2, y_train2)

In [21]:
y_pred1 = predict(model_target1, X_test_scaled1)
y_pred2 = predict(model_target2, X_test_scaled2)

In [22]:
results = evaluate_model(model_target1, y_test1, y_pred1)
results

{'accuracy': 0.9257985257985258,
 'classification_report': '              precision    recall  f1-score   support\n\n         0.0       0.96      0.92      0.94       875\n         1.0       1.00      0.00      0.00        46\n         2.0       0.90      0.97      0.93      1114\n\n    accuracy                           0.93      2035\n   macro avg       0.95      0.63      0.62      2035\nweighted avg       0.93      0.93      0.92      2035\n',
 'confusion_matrix': array([[ 807,    0,   68],
        [   0,    0,   46],
        [  37,    0, 1077]], dtype=int64),
 'precision': 0.9257985257985258,
 'recall': 0.9257985257985258,
 'f1_score': 0.9257985257985258}

# train and evaluate LDA without outlier

In [23]:
X, y1, y2 = split_dataset(df_LDA_without_outlier)

In [24]:
X_train1, X_test1, y_train1, y_test1 = split_train_test(X, y1)
X_train2, X_test2, y_train2, y_test2 = split_train_test(X, y2)

In [25]:
X_train_scaled1, X_test_scaled1 = standardize_features(X_train1, X_test1)
X_train_scaled2, X_test_scaled2 = standardize_features(X_train2, X_test2)

In [26]:
model_target1 = train_logistic_regression(X_train_scaled1, y_train1)
model_target2 = train_logistic_regression(X_train_scaled2, y_train2)

In [27]:
y_pred1 = predict(model_target1, X_test_scaled1)
y_pred2 = predict(model_target2, X_test_scaled2)

In [28]:
results = evaluate_model(model_target1, y_test1, y_pred1)
results

{'accuracy': 0.9331695331695332,
 'classification_report': '              precision    recall  f1-score   support\n\n         0.0       0.97      0.93      0.95       875\n         1.0       1.00      0.00      0.00        46\n         2.0       0.91      0.98      0.94      1114\n\n    accuracy                           0.93      2035\n   macro avg       0.96      0.63      0.63      2035\nweighted avg       0.94      0.93      0.92      2035\n',
 'confusion_matrix': array([[ 812,    0,   63],
        [   0,    0,   46],
        [  27,    0, 1087]], dtype=int64),
 'precision': 0.9331695331695332,
 'recall': 0.9331695331695332,
 'f1_score': 0.9331695331695332}