# Team:
# Antoine Abou Faycal & Mahdi Alhakim
# EECE 490
# GitHub Link: https://github.com/antoineabf/EECE490_Hackathon
# Predicting Brand Switching Due to Economic Factors Using Logistic Regression

## **Introduction**

The goal of this notebook is to **predict whether smokers have switched to alternative cigarette brands due to economic hardship**. This classification task leverages Logistic Regression to analyze various demographic, economic, lifestyle, and personality features influencing brand switching behavior.

---

## 1. Import Necessary Libraries

Begin by importing all the required libraries for data manipulation, preprocessing, model training, evaluation, and persistence.

In [9]:
# Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib



---

## 2. Load and Inspect the Dataset

Load the dataset from an Excel file into a pandas DataFrame and inspect its structure.



In [10]:
# Load the Dataset
df = pd.read_excel('2024_PersonalityTraits_SurveyData.xls')

# Display the first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Display column names
print("\nColumns in the dataset:")
print(df.columns.tolist())

First 5 rows of the dataset:
   Unnamed: 0   Sector  Last page  \
0           5  Private          5   
1          11  Private          5   
2          14  Private          5   
3          15  Private          5   
4          16  Private          5   

  Have you smoked at least one full tobacco cigarette (excluding e-cigarettes) once or more in the past 30 days?  \
0                                                Yes                                                               
1                                                Yes                                                               
2                                                Yes                                                               
3                                                Yes                                                               
4                                                Yes                                                               

  I see myself as someone who is extraverted, enthu



---

## 3. Define the Target Variable

Identify and process the target variable which indicates whether a respondent has switched cigarette brands due to the economic crisis.



In [11]:
# Define the Target Column
target_col = "Has 2019's revolution or economic crisis caused you to switch away from your favorite or preferred cigarette brand(s) to an\xa0 alternative?"

# Drop rows with missing target values
df = df.dropna(subset=[target_col])

# Create Binary Target Variable: 1 for 'Yes', 0 for 'No'
df['brand_switched'] = df[target_col].apply(lambda x: 1 if str(x).strip().lower().startswith('yes') else 0)

# Display Target Variable Distribution
print("Target Variable Distribution:")
print(df['brand_switched'].value_counts())

Target Variable Distribution:
brand_switched
0    134
1     78
Name: count, dtype: int64




---

## 4. Feature Selection

Select relevant features that may influence the decision to switch cigarette brands.



In [12]:
# Define Feature Columns
feature_cols = [
    'Gender:',
    'How old are you?',
    'Which governerate do you live in or spend most of your time in?',
    'What is the highest level of education you have attained?',
    'What is your current employment status?',
    'What is your main source of income?',
    'If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange).',
    'How would you describe your current income sufficiency?',
    'To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?',
    'How often do you exercise?',
    'How often do you feel stressed?',
    "Are you currently able to afford your favorite or preferred cigarette brand(s)?",
    'Do you find it difficult to refrain from smoking where it is forbidden (church, library, cinema, plane, etc...)?',
    'How would you describe your current smoking behavior compared to your smoking behavior before Lebanon\'s economic crisis and revolution began in 2019?',
    "I see myself as someone who is extraverted, enthusiastic:",
    "I see myself as someone who is critical, quarrelsome:",
    "I see myself as someone who is dependable, self-disciplined:",
    "I see myself as someone who is anxious, easily upset:",
    "I see myself as someone who is open to new experiences:",
    "I see myself as someone who is reserved, quiet:",
    "I see myself as someone who is sympathetic, warm:",
    "I see myself as someone who is disorganized, careless:",
    "I see myself as someone who is calm, emotionally stable:",
    "I see myself as someone who is conventional, uncreative:"
]

# Select Features and Target
X = df[feature_cols]
y = df['brand_switched']

print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")

Number of samples: 212
Number of features: 24




---

## 5. Data Preprocessing

Handle missing values, encode categorical variables, and scale numerical features to prepare the data for modeling.



In [13]:
# Identify Categorical and Numerical Features
categorical_features = [
    'Gender:',
    'Which governerate do you live in or spend most of your time in?',
    'What is the highest level of education you have attained?',
    'What is your current employment status?',
    'What is your main source of income?',
    'If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange).',
    'How would you describe your current income sufficiency?',
    'To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?',
    'How often do you exercise?',
    'How often do you feel stressed?',
    "Are you currently able to afford your favorite or preferred cigarette brand(s)?",
    'Do you find it difficult to refrain from smoking where it is forbidden (church, library, cinema, plane, etc...)?',
    'How would you describe your current smoking behavior compared to your smoking behavior before Lebanon\'s economic crisis and revolution began in 2019?',
    "I see myself as someone who is extraverted, enthusiastic:",
    "I see myself as someone who is critical, quarrelsome:",
    "I see myself as someone who is dependable, self-disciplined:",
    "I see myself as someone who is anxious, easily upset:",
    "I see myself as someone who is open to new experiences:",
    "I see myself as someone who is reserved, quiet:",
    "I see myself as someone who is sympathetic, warm:",
    "I see myself as someone who is disorganized, careless:",
    "I see myself as someone who is calm, emotionally stable:",
    "I see myself as someone who is conventional, uncreative:"
]

numeric_features = ['How old are you?']

# Create Preprocessing Pipelines
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features),
    ('num', numeric_transformer, numeric_features)
])



---

## 6. Train-Test Split

Divide the dataset into training and testing sets to evaluate the model's performance on unseen data.



In [14]:
# Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

Training samples: 169
Testing samples: 43




---

## 7. Model Training

Create a Logistic Regression pipeline, train the model, and make predictions on the test set.



In [15]:
# Create Logistic Regression Pipeline
logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Train the Model
logreg_pipeline.fit(X_train, y_train)

# Predict on Test Set
y_pred = logreg_pipeline.predict(X_test)



---

## 8. Evaluate Model Performance

Assess the model's accuracy, precision, recall, F1-score, and examine the confusion matrix.



In [16]:
# Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}\n")

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.88

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.96      0.91        27
           1       0.92      0.75      0.83        16

    accuracy                           0.88        43
   macro avg       0.89      0.86      0.87        43
weighted avg       0.89      0.88      0.88        43

Confusion Matrix:
[[26  1]
 [ 4 12]]




---

## 9. Save the Trained Model

Persist the trained model to disk for future use without retraining.



In [17]:
# Save the Trained Model
joblib.dump(logreg_pipeline, 'brand_switch_logreg_model.pkl')
print("Trained model saved as 'brand_switch_logreg_model.pkl'.")

Trained model saved as 'brand_switch_logreg_model.pkl'.




---

## 10. Load the Trained Model

Demonstrate how to load the saved model for making future predictions.



In [18]:
# Load the Trained Model
try:
    loaded_model = joblib.load('brand_switch_logreg_model.pkl')
    print("Model loaded successfully.")
except FileNotFoundError:
    print("Error: The model file 'brand_switch_logreg_model.pkl' was not found.")
    # You may need to train the model before running this script.
    exit()

Model loaded successfully.




---

## 11. Define and Predict Test Cases

Create two test cases—one expecting a "Yes" prediction and another expecting a "No"—to validate the model's predictions.



In [19]:
# Define Two Test Cases: One Expected to Predict 'Yes' and Another 'No'

test_cases = {
    "Case 1": {
        'Gender:': 'Male',
        'How old are you?': 28,
        'Which governerate do you live in or spend most of your time in?': 'Beirut',
        'What is the highest level of education you have attained?': "Bachelor's degree (BA/BS)",
        'What is your current employment status?': 'Employee; full-time',
        'What is your main source of income?': 'Job',
        'If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange).': 'Between 4 and 8 million L.L',
        'How would you describe your current income sufficiency?': 'Low: barely covers basic needs',
        'To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?': 'Significantly affected',
        'How often do you exercise?': 'Rarely or never',
        'How often do you feel stressed?': 'Often or at least 3 days every week',
        "Are you currently able to afford your favorite or preferred cigarette brand(s)?": 'No',
        'Do you find it difficult to refrain from smoking where it is forbidden (church, library, cinema, plane, etc...)?': 'Yes',
        'How would you describe your current smoking behavior compared to your smoking behavior before Lebanon\'s economic crisis and revolution began in 2019?': 'I smoke fewer cigarettes per day',
        "I see myself as someone who is extraverted, enthusiastic:": 'Agree a little',
        "I see myself as someone who is critical, quarrelsome:": 'Agree moderately',
        "I see myself as someone who is dependable, self-disciplined:": 'Disagree a little',
        "I see myself as someone who is anxious, easily upset:": 'Agree strongly',
        "I see myself as someone who is open to new experiences:": 'Agree a little',
        "I see myself as someone who is reserved, quiet:": 'Disagree a little',
        "I see myself as someone who is sympathetic, warm:": 'Agree strongly',
        "I see myself as someone who is disorganized, careless:": 'Agree strongly',
        "I see myself as someone who is calm, emotionally stable:": 'Disagree a little',
        "I see myself as someone who is conventional, uncreative:": 'Agree a little'
    },
    "Case 2": {
        'Gender:': 'Female',
        'How old are you?': 35,
        'Which governerate do you live in or spend most of your time in?': 'Mount Lebanon',
        'What is the highest level of education you have attained?': "Master's degree (MA/MS)",
        'What is your current employment status?': 'Self-employed',
        'What is your main source of income?': 'Business',
        'If you receive payment in Lebanese Lira, what is your current estimated monthly household income? (If income is in US Dollars, then refer to the current black market exchange).': 'Above 12 million L.L',
        'How would you describe your current income sufficiency?': 'High: more than enough',
        'To what extent were you financially (negatively) affected by the deterioration of the Lebanese economy?': 'Not affected',
        'How often do you exercise?': 'Regularly or every day',
        'How often do you feel stressed?': 'Rarely or never',
        "Are you currently able to afford your favorite or preferred cigarette brand(s)?": 'Yes',
        'Do you find it difficult to refrain from smoking where it is forbidden (church, library, cinema, plane, etc...)?': 'No',
        'How would you describe your current smoking behavior compared to your smoking behavior before Lebanon\'s economic crisis and revolution began in 2019?': 'I smoke the same number of cigarettes per day',
        "I see myself as someone who is extraverted, enthusiastic:": 'Agree strongly',
        "I see myself as someone who is critical, quarrelsome:": 'Disagree strongly',
        "I see myself as someone who is dependable, self-disciplined:": 'Agree strongly',
        "I see myself as someone who is anxious, easily upset:": 'Agree a little',
        "I see myself as someone who is open to new experiences:": 'Agree strongly',
        "I see myself as someone who is reserved, quiet:": 'Disagree a little',
        "I see myself as someone who is sympathetic, warm:": 'Agree moderately',
        "I see myself as someone who is disorganized, careless:": 'Disagree strongly',
        "I see myself as someone who is calm, emotionally stable:": 'Agree strongly',
        "I see myself as someone who is conventional, uncreative:": 'Disagree a little'
    }
}

# Convert Test Cases to DataFrame
test_df_yes = pd.DataFrame([test_cases["Case 1"]])
test_df_no = pd.DataFrame([test_cases["Case 2"]])



---

## 12. Make Predictions on Test Cases

Use the trained model to predict whether the test cases have switched brands due to economic factors.



In [20]:
# Function to Make Predictions
def predict_brand_switch(model, input_df):
    prediction = model.predict(input_df)[0]
    predicted_label = "Yes" if prediction == 1 else "No"
    return predicted_label

# Make Predictions
prediction_yes = predict_brand_switch(loaded_model, test_df_yes)
prediction_no = predict_brand_switch(loaded_model, test_df_no)

# Display Predictions
print("\nTest Case Predictions:")
print(f"Yes Case Prediction (Expected: Yes): {prediction_yes}")
print(f"No Case Prediction (Expected: No): {prediction_no}")


Test Case Predictions:
Yes Case Prediction (Expected: Yes): Yes
No Case Prediction (Expected: No): No




---

## 13. Conclusion

In this notebook, we successfully built and evaluated a Logistic Regression model to predict whether smokers switch to alternative cigarette brands due to economic hardship. The workflow included:

1. **Data Loading and Inspection**: Imported the dataset and examined its structure.
2. **Defining the Target Variable**: Created a binary target variable indicating brand switching.
3. **Feature Selection**: Selected relevant features encompassing demographics, economic factors, lifestyle, and personality traits.
4. **Data Preprocessing**: Handled missing values, encoded categorical variables, and scaled numerical features.
5. **Model Training and Evaluation**: Trained the Logistic Regression model and evaluated its performance using accuracy, classification report, and confusion matrix.
6. **Model Persistence**: Saved the trained model for future use.
7. **Prediction Demonstration**: Defined and predicted outcomes for two distinct test cases, validating the model's predictive capabilities.

