#### Train Logistic Regression Model

Import packages and Load Data

In [1]:
# import packages for training
import pandas as pd
import numpy as np # Good to have for ML tasks
import os # Import the os module
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score # Use later for evaluation

In [3]:
# Define the base path components
base_project_folder = r"C:\Users\comat\GitProjects\customer-churn-ai" # Raw string is fine here
data_subfolder = "data"
training_input_subfolder_name = "training_input"

# Construct the path to the 'training_input' directory robustly
training_input_path = os.path.join(base_project_folder, data_subfolder, training_input_subfolder_name)

# Define full file paths using os.path.join(). (these should match what you saved them as)
x_train_path = os.path.join(training_input_path, "X_train.parquet")
x_test_path = os.path.join(training_input_path, "X_test.parquet")
y_train_path = os.path.join(training_input_path, "y_train.parquet")
y_test_path = os.path.join(training_input_path, "y_test.parquet")

try:
    # Load DataFrames (X_train, X_test)
    X_train = pd.read_parquet(x_train_path)
    X_test = pd.read_parquet(x_test_path)
    print(f"X_train loaded successfully. Shape: {X_train.shape}")
    print(f"X_test loaded successfully. Shape: {X_test.shape}")

    # Load y_train and y_test (they were saved as DataFrames with a 'Churn' column)
    y_train_df = pd.read_parquet(y_train_path)
    y_test_df = pd.read_parquet(y_test_path)

    # Convert y_train and y_test back to Pandas Series for scikit-learn
    if 'Churn' in y_train_df.columns and 'Churn' in y_test_df.columns:
        y_train = y_train_df['Churn']
        y_test = y_test_df['Churn']
        print(f"\ny_train loaded successfully. Shape: {y_train.shape}")
        print(f"y_test loaded successfully. Shape: {y_test.shape}")
        print("\nAll datasets loaded and y_train/y_test converted to Series.")
    else:
        print("Error: 'Churn' column not found in loaded y_train_df or y_test_df.")
        # Handle error or stop if target is not loaded correctly

except FileNotFoundError:
    print(f"Error: One or more Parquet files not found. Please check paths:")
    print(f"  X_train expected at: {x_train_path}")
    print(f"  X_test expected at: {x_test_path}")
    print(f"  y_train expected at: {y_train_path}")
    print(f"  y_test expected at: {y_test_path}")
except Exception as e:
    print(f"An error occurred while loading the data: {e}")

X_train loaded successfully. Shape: (5634, 32)
X_test loaded successfully. Shape: (1409, 32)

y_train loaded successfully. Shape: (5634,)
y_test loaded successfully. Shape: (1409,)

All datasets loaded and y_train/y_test converted to Series.


In [4]:
# Display the head of X_train and first 5 of y_train to verify
print("\nHead of X_train:")
print(X_train.head())
print("\nHead of y_train:")
print(y_train.head())


Head of X_train:
   SeniorCitizen  MonthlyCharges  TotalCharges    HF_neg    HF_nue    HF_pos  \
0              0       -0.521976     -0.263871 -0.596856 -0.608280  0.631591   
1              0        0.337478     -0.505423 -0.594176  0.155381  0.535430   
2              0       -0.809013     -0.751850 -0.578528  2.236056  0.265643   
3              0        0.284384     -0.174271 -0.599526 -0.571609  0.629585   
4              0       -0.676279     -0.991514 -0.595537 -0.536079  0.621505   

   gender_Male  Partner_Yes  Dependents_Yes  PhoneService_Yes  ...  \
0         True        False           False             False  ...   
1         True         True            True              True  ...   
2         True         True            True             False  ...   
3        False         True           False              True  ...   
4         True         True            True              True  ...   

   PaymentMethod_Credit card (automatic)  PaymentMethod_Electronic check  \
0   

---
Initialize Logistic Regression Model and Train the Model

In [5]:
# 1. Initialize the Logistic Regression Model
# Begin with mostly default parameters. `liblinear` solver used for binary classification and small datasets.
# `random_state` used for reproducibility of results if solver involves randomness.
# `max_iter` set to 1000 to ensure convergence for complex datasets. default is 100.
log_reg_model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)
print("Logistic Regression model initialized with solver='liblinear' and max_iter=1000.")

Logistic Regression model initialized with solver='liblinear' and max_iter=1000.


In [6]:
# 2. Fit/Train the model using the training data.
print("Training the Logistic Regression model...")
log_reg_model.fit(X_train, y_train)
print("Logistic Regression model trained successfully!")

Training the Logistic Regression model...
Logistic Regression model trained successfully!


In [7]:
# QUICK Check on training accuracy [NOT a substitute for test set evaluation!]
# This just tells us how well the model fits the data it learned from.
y_train_pred_log_reg = log_reg_model.predict(X_train)
train_accuracy_log_reg = accuracy_score(y_train, y_train_pred_log_reg)
print(f"\nQuick check: Training Accuracy for Logistic Regression: {train_accuracy_log_reg:.4f}")


Quick check: Training Accuracy for Logistic Regression: 1.0000


#### Insights / Notes:
* Training accuracy is 1.0, which means the model learned the data so well that it makes no mistakes. This is not necessarily good.  

* **Potential Implications**:
    * **Overfitting**:
        - *Definition*: when a machine learning model learns the training data too specifically, capturing not only the underlying patterns but also the noise and random fluctuations unique to that particular set of data.
        - *Result*: model becomes very good at predicting the data it has already seen but performs poorly when it encouter new, unseen data (which will be X_test). This is an issue because an overfit model has poor generization, which is the goal of a predictive model, and will lead to misleading sense of real-world performance.
        * *Analogy*:
            * Overfitting is like a student who has memorized the exact answers to every single question in the practice textbook (X_train, y_train), so they can ace any test that uses those exact questions.
            * However, if the actual exam (X_test, y_test) has slightly different questions or new scenarios (even if based on the same overall subject), the student who only memorized will struggle because they didn't learn the underlying concepts needed to solve new problems.
            * We want a student (model) who learns the concepts well enough to perform well on both practice questions and the real exam.  

* **Plan**:
    * Test log_reg_model on X_test and y_test data, which it hasn't seen before.
    * Train Other Models like Random Forest and an XGBoost model. They may behave differently on the training so I can compare their results. Choose one that generalizes best
    
    