<a href="https://colab.research.google.com/github/fengfrankgthb/BUS-41204/blob/main/SL-2-1-FlowerChurnExample2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification basics using florist churn data

In this notebook, we'll illustrate classification using logistic regression with the main goal being to define common classification evaluation measures.

The data were downloaded from https://huggingface.co/ and may be artificial. However, they serve to illustrate the points we want to learn about.

If the data are real, they have likely been *balanced* by undersampling the majority class (customers that did not *churn*). The other alternative is that the florist is really bad at retaining customers. Either way, we're going to ignore this as it is tangential to the point of the exercise.



# Python libraries

As usual, we'll start by importing libraries we're going to make use of.

In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import KFold, GridSearchCV

# Import and examine data

We'll import the data from the course github repository.

In [None]:
file = "https://raw.githubusercontent.com/chansen776/MBA-ML-Course-Materials/main/Data/florist_customer_churn.csv"
data = pd.read_csv(file)

Let's see what we've got in the dataset.

In [None]:
print(data.columns)
data.head()

The outcome variable is `churn`. We see that we have mostly qualitative data including the column `feedback` which is a basic example of text data that we'll need to deal with.

Note that `total_charges` = `tenure`*`monthly_charges`.

In [None]:
# Let's calculate the correlation of total_charges with tenure*monthly_charges
print('Correlation of total_charges with tenure*monthly_charges:',
      data['total_charges'].corr(data['tenure']*data['monthly_charges']))

Let's see what our variables (outside of `feedback`) look like.

In [None]:
# histogram of tenure
sns.displot(data=data, x='tenure', bins = 10)
plt.title('Histogram of Tenure')
plt.show()

# histogram of monthly_charges
sns.displot(data=data, x='monthly_charges', bins = 10)
plt.title('Histogram of Monthly Charges')
plt.show()

# histogram of total_charges
sns.displot(data=data, x='total_charges', bins = 10)
plt.title('Histogram of Total Charges')
plt.show()


Let's tabulate the qualitative variables.

In [None]:
data['churn'].value_counts()

As noted above, we certainly hope this is not a representative snapshot of a real firm's customer retention.

In [None]:
print(data['contract'].value_counts())
print(sum((data['contract']).isna()))

We have three types of contracts. We'll include these as dummy variables.

In [None]:
print(data['payment_method'].value_counts())
print(sum((data['payment_method']).isna()))

Four payment methods, which we'll include as dummies.

In [None]:
print(data['topic'].value_counts())
print(sum((data['topic']).isna()))

We have 26 observations that do not have a topic recorded. We'll treat those observations as just belonging to a different "null" topic.

In [None]:
# Replace non-string values in the 'topic' column with an empty string
data['topic'] = data['topic'].fillna('')
print(data['topic'].value_counts())
print(sum((data['topic']).isna()))

We also have some empty values in `feedback`. We're not going to tabulate feedback, but we will assign these observations to have a "null" feedback.

In [None]:
# Replace non-string values in the 'feedback' column with an empty string
data['feedback'] = data['feedback'].fillna('')
print(sum(data['feedback'].isna()))

# Classifying sentiment

We want to use the customer feedback information, but we certainly can't use the raw text and probably don't want to make dummy variables for each unique value (as there are many different phrases). Instead we are going to use a pre-trained large language model to construct the *sentiment* of each string.

Specifically, we are going to use a [sentiment classification model that was fine tuned using Amazon reviews available on hugging face](https://huggingface.co/AdamCodd/distilbert-base-uncased-finetuned-sentiment-amazon).

This model will return a prediction of "positive" or "negative" for each customer feedback phrase. It will also return a score between 0 and 1 (the predicted probability that the feedback was positive) that we can think about using as a continuout feature in our model to predict churn.

Using the text data in this way is an example of *feature engineering*. We'll talk more about feature engineering as we go through the course.

In [None]:
from transformers import pipeline

sentiment_analysis = pipeline("sentiment-analysis",
                              model="AdamCodd/distilbert-base-uncased-finetuned-sentiment-amazon")

Let's look at how the imported sentiment model assigns sentiment to some phrases.

In [None]:
# "random" phrases
print(sentiment_analysis("I love this!"))
print(sentiment_analysis("I hate this!"))
print(sentiment_analysis("Things are ok."))
print('\n')

# A phrase from our data
print(data['feedback'][10])
sentiment_analysis(data['feedback'][10])

We're now going to generate sentiment scores for all the phrases in our data.

In [None]:
# Construct sentiment "probabilities" and label
sentprobs = []
sent = []
for i in range(len(data)):
  tmp = sentiment_analysis(data['feedback'][i])
  sent.append(tmp[0]['label'])
  if tmp[0]['label'] == 'negative':
    sentprobs.append(1-tmp[0]['score'])
  else:
    sentprobs.append(tmp[0]['score'])

# Make into data and merge with the original data
sentprobs = pd.DataFrame(sentprobs, columns=['polarity'])
sent = pd.DataFrame(sent, columns=['sentiment'])
data = pd.concat([data, sentprobs, sent], axis=1)


Let's look at the new variables.

In [None]:
# Positive/Negative sentiment
print(data['sentiment'].value_counts())

# Polarity
print(data['polarity'].describe())

# histogram of polarity
sns.displot(data=data, x='polarity')
plt.title('Histogram of Polarity')
plt.show()

Recall that we had some people return no text response. They should not be labeled "positive" or "negative". Let's relabel them as "No Feedback."

In [None]:
print(data[data['feedback'] == '']['sentiment'].value_counts())
print('\n')

data.loc[data['feedback'] == '', 'sentiment'] = 'No Feedback'
print(data['sentiment'].value_counts())

We'd really like to verify that the sentiment extracted makes sense by doing human verification (or even better training our own sentiment model based on human feedback). We're going to just do some sanity checking by looking at polarity assigned to some of our phrases.

In [None]:
# Phrases ranked in the bottom 10 of positivity
data[data['polarity'].rank(method = 'min') < 10][['feedback','polarity']]

In [None]:
# Phrases associated ranked in the top 10 of positivity
data[data['polarity'].rank(method = 'max') > 990][['feedback','polarity']]

In [None]:
# Word cloud of the positive sentiment feedback
text = ' '.join(data[data['sentiment'] == 'positive']['feedback'].dropna().astype(str).tolist())

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.show()

In [None]:
# Word cloud of the negative sentiment feedback
text = ' '.join(data[data['sentiment'] == 'negative']['feedback'].dropna().astype(str).tolist())

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.show()

# Train/Test Split

We are going to use a simple train/test split to validate our models (rather than bother with cross-validation). We also don't have much data, so we won't keep aside a separate test data set.

In [None]:
# Split the data into training (80%) and validation (20%) sets
train_data, validation_data = train_test_split(data, test_size=0.2, random_state=726)

# Logistic Regression

We are going to look at predicting churn using a baseline classification model: *Logistic Regression*.

Logistic regression builds a model for the probability that the outcome variable equals 1. In our example, this will correspond to the probability that a customer leaves (`churn` = `True`).

Logistic regression specifically builds a linear combination of the feature variables that is evaluated inside the logistic (aka *sigmoid*) function to return a value between 0 and 1 (a probability). This is, we build a rule to predict the probability that the outcome is one of the form

$$\widehat{\text{Pr}(Y = 1|X)} = \sigma(b_0 + b_1 X_1 + ... + b_p X_p)$$

where $\sigma(\cdot)$ is the sigmoid function

$$\sigma(u) = \frac{1}{1 + \exp(-u)} = \frac{\exp(u)}{1 + \exp(u)}.$$

With the predicted probabilities in hand, we can then *classify* the outcome as either a 1 or 0 depending on whether the predicted probability is greater than 0.5.


In [None]:
# Set data up for use with logistic regression

# Select features and target variable for training
features = ['tenure', 'monthly_charges', 'total_charges', 'contract',
            'payment_method', 'topic', 'polarity', 'sentiment']

# Get outcome and feature data in train and validation data including dummies
# for our categorical variables
# Because we are not doing regularization or any kind of variable selection,
# we will drop one of each set of dummies.
X_train = pd.get_dummies(train_data[features], dtype=float, drop_first=True)
y_train = train_data['churn']

X_validation = pd.get_dummies(validation_data[features], dtype=float, drop_first=True)
y_validation = validation_data['churn']

# Align validation data to ensure same columns as training data
X_validation = X_validation.reindex(columns=X_train.columns, fill_value=0)

# Fitting the logistic regression
logistic_model = LogisticRegression(max_iter=1000, random_state=726, penalty = None)
logistic_model.fit(X_train, y_train)

# Estimated model parameters
print(pd.DataFrame(data={'Variable Names': X_train.columns, 'Coefficient': logistic_model.coef_[0]}))
print('\n')
print('Intercept', logistic_model.intercept_[0])

Let's see how we do predicting churn in the validation data.

In [None]:
# Make predictions on the validation set
y_pred_logistic = logistic_model.predict(X_validation)
y_pred_prob_logistic = logistic_model.predict_proba(X_validation)[:, 1]

## Confusion Matrix

A standard way to look at predictive performance for classification is via a *confusion matrix*. The confusion matrix just represents classification accuracy by displaying *True Positives* (TP), *False Positives* (FP), *True Negatives* (TN), and *False Negatives* (FN) in the binary case. (With multiple classes, the confusion matrix shows true classifications into each class and false classifications into each class.)

In [None]:
# Display confusion matrix
ConfusionMatrixDisplay.from_predictions(y_validation, y_pred_logistic)
plt.show()

In our toy example, we see that we do a pretty good job predicting our 200 hold observations. We make 10 total mistakes, 5 FP and 5 FN.

## Other common performance measures.

There are many other performance measures people look at. Here we compute a few more:



*   Precision = TP/(TP + FP). Precision is the ratio of true positives to overall positives (total number of times the class was predicted). Here we're really focusing on false positives: Precision will be high when there are few false positives. You would care about this metric in situations where false positives are relatively costly. E.g. spam detection, fraud detection.
*   Recall = TP/(TP + FN). Recall, also called *sensitivity* is the ratio of true positive predictions to actual positive cases. Here we're really focusing on false negatives: Recall will be high when there are few false negatives. You would care about this metric in situations where false negatives are relatively costly. E.g. medical screening, churn prediction.
*   F1 = 2TP/(2TP + FP + FN). F1 tries to balance false negatives and false positives.
*   Accuracy. (TP + TN)/N. Fraction (sometimes reported as number) of correct predictions.

In [None]:
# Compute classification metrics for the logistic regression model
logistic_classification_metrics = classification_report(y_validation, y_pred_logistic, output_dict=True)
pd.DataFrame(logistic_classification_metrics)

In our toy example, we end up with identical precision, recall, and f1. This is a fluke. It just happens that the true and false predictions line up in exactly the right way.

In the report, we have two additional columns *macro avg* and *weighted avg*.



*   macro avg: Reports the average of the associated performance measure across classes. E.g. in our example, macro avg for precision = (0.95098+0.94898)/2 (where 2 is the number of classes we are averaging over).
*   weighted avg: Reports the average of the associated performance measure across classes weighted by class size. E.g. in our example, weighted avg for precision = (102/200)$*$0.95098+(98/200)$*$0.94898



# ROC Curve and AUC

The *ROC curve* is another commonly provided summary of prediction accuracy. (ROC stands for *receiver operating characteristic* which is not particularly evocative as a name).

ROC plots the TP rate = TP/(TP+FP) against the FP rate FP/(TP + FP) where TP and FP are calculated at different "decision thresholds": We consider classifying each observation according to $\hat{p}_i > \textrm{decision threshold}$ varying the decision threshold from 0 to 1.

The left-hand-side of the curve corresponds to a decision threshold of 0 and the right-hand-side to a decision threshold of 1.

The dashed line is what you get from "random guessing".

*Area under the curve* (AUC) provides a summary of the ROC curve with numbers closer to 1 indicating better performance. An AUC of one would be a perfect ranking of predictions - you always get 100% TP regardless of the threshold.

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_validation, y_pred_prob_logistic)
roc_auc_logistic = roc_auc_score(y_validation, y_pred_prob_logistic)

plt.plot(fpr, tpr, label=f'Logistic Regression (area = {roc_auc_logistic:.2f})')
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


## Cumulative Gain and Lift

The last two classification performance summaries we'll look at are *cumulative gain* and *lift*.

**Cumulative gain**

Cumulative gain considers classification performance when you are interested in targeting x\% of the "population". The cumulative gain chart plots the fraction of individuals to be targeted on the x-axis against the fraction of individuals correctly classified as belonging to the "1" group (the true positive rate) on the y-axis.

Cumulative gain provides a quick visual of how well our classifier is at targeting our outcome.

In [None]:
# Creating a DataFrame with the true values and predicted probabilities
data = pd.DataFrame({'true': y_validation, 'prob': y_pred_prob_logistic})
data.sort_values(by='prob', ascending=False, inplace=True)

# Calculating cumulative gain
data['cumulative_gain'] = np.cumsum(data['true']) / data['true'].sum()
data['cumulative_percentage'] = np.arange(1, len(data) + 1) / len(data)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(data['cumulative_percentage'], data['cumulative_gain'], label='Cumulative Gain')
plt.plot([0, 1], [0, 1], 'r--', label='Baseline')
plt.xlabel('Percentage of samples')
plt.ylabel('Cumulative gain')
plt.title('Cumulative Gain Chart')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

**Lift**

Lift is essentially just a different representation of cumulative gain. In the lift chart, we are looking at the ratio of results obtained with and without the predictive model.

The lift chart is essentially telling us how much more likely we are to see positive responses by targeting people according to our model relative to targeting people at random.

In [None]:
# Calculating lift
data['lift'] = data['cumulative_gain'] / data['cumulative_percentage']

# Plotting lift curve
plt.figure(figsize=(10, 6))
plt.plot(data['cumulative_percentage'], data['lift'], label='Lift Curve')
plt.plot([0, 1], [1, 1], 'r--')
plt.xlabel('Percentage of samples')
plt.ylabel('Lift')
plt.title('Lift Curve')
plt.legend(loc='upper right')
plt.grid(True)
plt.show()

# Classification Tree

Just like we can use trees for regression, we can also use them for classification. The basic ideas are the same.

In [None]:
# Now let's try a classification tree

# Select features and target variable for training
features = ['tenure', 'monthly_charges', 'total_charges', 'contract',
            'payment_method', 'topic', 'polarity', 'sentiment']

# Get outcome and feature data in train and validation data including dummies
# for our categorical variables
X_train = pd.get_dummies(train_data[features], dtype=float)
y_train = train_data['churn']

X_validation = pd.get_dummies(validation_data[features], dtype=float)
y_validation = validation_data['churn']

# Align validation data to ensure same columns as training data
X_validation = X_validation.reindex(columns=X_train.columns, fill_value=0)

# Cross-validate using only training data to get "best" tree
cvsplit = KFold(n_splits=5, shuffle=True, random_state=729)

# Parameter we want to choose based on cross-validation performance - number of leaves
parameters = {'max_leaf_nodes':range(2,51)}

# Define model and do cross-validation
tree = DecisionTreeClassifier()
cv_tree = GridSearchCV(tree, parameters, scoring='accuracy', refit=True, cv=cvsplit)
# We can evaluate our performance based on many different measures. We're using
# accuracy in this example. The commented line below uses recall instead. We might
# care more about recall in this example than overall accuracy as false negatives,
# incorrectly classifying someone as staying when they are going to leave, may
# be more costly than false positives in this example.
#cv_tree = GridSearchCV(tree, parameters, scoring='recall', refit=True, cv=cvsplit)

# Perform cross validation
cv_tree.fit(X_train, y_train)

# Pull out and plot the tree corresponding to the best prediction rule
# according to CV.
best_tree = cv_tree.best_estimator_

plot_tree(best_tree, feature_names = X_train.columns)
plt.show()

Our tree-based classification rule looks extremely simple. It is clearly very interpretable.

Let's see what the cross-validated performance looks like with different numbers of leaves.

In [None]:
leaves = cv_tree.cv_results_.get('param_max_leaf_nodes')
leaves = leaves.tolist()

lranks = cv_tree.cv_results_.get('rank_test_score')
loss = cv_tree.cv_results_.get('mean_test_score')

plt.plot(leaves, loss, label = 'Loss')
plt.axvline(cv_tree.best_params_.get('max_leaf_nodes'),
            linestyle="--", color="black", label="CV estimate")
plt.xlabel("Number of leaves")
plt.ylabel("Cross-validation Performance")
plt.legend()
plt.show()


## Classification performance of tree

Let's see how well the tree does in classifying our held out observations.

In [None]:
# Make predictions on the validation set
y_pred_tree = best_tree.predict(X_validation)
y_pred_prob_tree = best_tree.predict_proba(X_validation)[:, 1]

In [None]:
# Display confusion matrix
ConfusionMatrixDisplay.from_predictions(y_validation, y_pred_tree)
plt.show()

This is identical to what we saw for logistic regression. It then follows that recall, precision, and f1 will also be the same.

In [None]:
# Compute classification metrics for the tree regression model
tree_classification_metrics = classification_report(y_validation, y_pred_tree, output_dict=True)
pd.DataFrame(tree_classification_metrics)

We can compare the tree model to the logistic model in terms of ROC and lift as well.

In [None]:
# ROC Curve
fprtree, tprtree, thresholdstree = roc_curve(y_validation, y_pred_prob_tree)
roc_auc_tree = roc_auc_score(y_validation, y_pred_prob_tree)

plt.plot(fpr, tpr, label=f'Logistic Regression (area = {roc_auc_logistic:.2f})')
plt.plot(fprtree, tprtree, label=f'Tree (area = {roc_auc_tree:.2f})')
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


In [None]:
# Creating a DataFrame with the true values and predicted probabilities
datatr = pd.DataFrame({'true': y_validation, 'prob': y_pred_prob_tree})
datatr.sort_values(by='prob', ascending=False, inplace=True)

# Calculating cumulative gain
datatr['cumulative_gain'] = np.cumsum(datatr['true']) / datatr['true'].sum()
datatr['cumulative_percentage'] = np.arange(1, len(datatr) + 1) / len(datatr)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(data['cumulative_percentage'], data['cumulative_gain'], label='Cumulative Gain - Logistic')
plt.plot(datatr['cumulative_percentage'], datatr['cumulative_gain'],
         label='Cumulative Gain - Tree')
plt.plot([0, 1], [0, 1], 'r--', label='Baseline')
plt.xlabel('Percentage of samples')
plt.ylabel('Cumulative gain')
plt.title('Cumulative Gain Chart')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

In [None]:
# Calculating lift
datatr['lift'] = datatr['cumulative_gain'] / datatr['cumulative_percentage']

# Plotting lift curve
plt.figure(figsize=(10, 6))
plt.plot(data['cumulative_percentage'], data['lift'], label='Lift Curve - Logistic')
plt.plot(datatr['cumulative_percentage'], datatr['lift'], label='Lift Curve - Tree')
plt.plot([0, 1], [1, 1], 'r--')
plt.xlabel('Percentage of samples')
plt.ylabel('Lift')
plt.title('Lift Curve')
plt.legend(loc='upper right')
plt.grid(True)
plt.show()

Logistic prediction rule and tree prediction rule give essentially identical performance in our validation data!