## Preprocessing and Modeling

Recall that the data we are working with are complaints filed to the Consumer Financial Protection Bureau in 2022. Our target variable is predicting 'Timely response?'. We will need to perform NLP on 'Consumer complaint narrative'. Our top catagorical features based on chi-square analysis are 'Product', 'Sub-product', 'Company', which are self explanatory features. 

In [2]:
#import packages
import re
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, FunctionTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from scipy.sparse import hstack
from sklearn.base import BaseEstimator, TransformerMixin


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler

In [3]:
#From the previous notebook, we dropped irrelevant columns
df = pd.read_csv('complaints-2024-07-16_13_58.csv')
columns_to_drop=['Date received','Company public response','State','ZIP code','Tags','Consumer consent provided?','Submitted via','Date sent to company','Company response to consumer','Consumer disputed?','Complaint ID']
df = df.drop(columns_to_drop,axis=1)

In [14]:
df.columns

Index(['Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company', 'Timely response?'],
      dtype='object')

Let us first start with a rudimentary decision tree model that just uses the features 'Company' and 'Consumer complaint narrative' to predict 'Timely response?'.

For the features we need:
- to handle the redacted text in consumer complaints. I opted at first to replace the X's with '\<REDACTED>'. Later the X's will be removed.
- use an encoder for the companies, I opted for ordinal because 2k+ unique companies seemed like too much for one-hot encoding

We will look at the classification report. 

In [27]:
# Select relevant columns
features = df.drop(['Timely response?'],axis=1)
target = df['Timely response?'].apply(lambda x: 1 if x == 'Yes' else 0)

# Function to handle redacted information
def handle_redacted(text, strategy="replace"):
    if strategy == "replace":
        return re.sub(r'XX+', '<REDACTED>', text)
    elif strategy == "remove":
        return re.sub(r'XX+', '', text)
    else:
        return text

# Apply the redacted handling function
features['Consumer complaint narrative'] = features['Consumer complaint narrative'].apply(lambda x: handle_redacted(x, strategy="replace"))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# TF-IDF Vectorization for complaint narratives
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train['Consumer complaint narrative'])
X_test_tfidf = vectorizer.transform(X_test['Consumer complaint narrative'])

# Ordinal Encoding for companies
# Combine training and testing data for encoding
combined_data = pd.concat([X_train[['Company']], X_test[['Company']]])

# Initialize OrdinalEncoder with handle_unknown='use_encoded_value' (optional)
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit and transform on combined data
combined_encoded = ordinal_encoder.fit_transform(combined_data)

# Split back into training and testing sets
X_train_companies_encoded = combined_encoded[:len(X_train)]
X_test_companies_encoded = combined_encoded[len(X_train):]

# Combine TF-IDF vectors and company encodings
X_train_combined = hstack([X_train_tfidf, X_train_companies_encoded.reshape(-1, 1)])
X_test_combined = hstack([X_test_tfidf, X_test_companies_encoded.reshape(-1, 1)])
# Combine TF-IDF vectors and company encodings
X_train_combined = hstack([X_train_tfidf, X_train_companies_encoded])
X_test_combined = hstack([X_test_tfidf, X_test_companies_encoded])

# Train the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_combined, y_train)

# Predict and evaluate
y_pred = clf.predict_proba(X_test_combined)[:, 1]
auc = roc_auc_score(y_test, y_pred)
print(f"AUC: {auc:.4f}")

AUC: 0.5478


In [44]:
y_pred_binary = (y_pred >= 0.5).astype(int)

print(classification_report(y_test, y_pred_binary))

              precision    recall  f1-score   support

           0       0.11      0.10      0.10       533
           1       0.99      0.99      0.99     65110

    accuracy                           0.99     65643
   macro avg       0.55      0.55      0.55     65643
weighted avg       0.99      0.99      0.99     65643



Although the accuracy is around 99%, recall that our data is imbalanced and that around 2% of our data is negative. So accuracy is not a good metric for our models. Instead we will look at AUC. Let's compare with two other models: Random Forest and Logistic Regression.

In [13]:
# Select relevant columns
features = df.drop(['Timely response?'],axis=1)
target = df['Timely response?'].apply(lambda x: 1 if x == 'Yes' else 0)

# Function to handle redacted information
def handle_redacted(text, strategy="replace"):
    if strategy == "replace":
        return re.sub(r'XX+', '<REDACTED>', text)
    elif strategy == "remove":
        return re.sub(r'XX+', '', text)
    else:
        return text

# Apply the redacted handling function
features['Consumer complaint narrative'] = features['Consumer complaint narrative'].apply(lambda x: handle_redacted(x, strategy="replace"))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# TF-IDF Vectorization for complaint narratives
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train['Consumer complaint narrative'])
X_test_tfidf = vectorizer.transform(X_test['Consumer complaint narrative'])

# Ordinal Encoding for companies
# Combine training and testing data for encoding
combined_data = pd.concat([X_train[['Company']], X_test[['Company']]])

# Initialize OrdinalEncoder with handle_unknown='use_encoded_value' (optional)
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit and transform on combined data
combined_encoded = ordinal_encoder.fit_transform(combined_data)

# Split back into training and testing sets
X_train_companies_encoded = combined_encoded[:len(X_train)]
X_test_companies_encoded = combined_encoded[len(X_train):]

# Combine TF-IDF vectors and company encodings
X_train_combined = hstack([X_train_tfidf, X_train_companies_encoded.reshape(-1, 1)])
X_test_combined = hstack([X_test_tfidf, X_test_companies_encoded.reshape(-1, 1)])
# Combine TF-IDF vectors and company encodings
X_train_combined = hstack([X_train_tfidf, X_train_companies_encoded])
X_test_combined = hstack([X_test_tfidf, X_test_companies_encoded])

# Train the Decision Tree model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_combined, y_train)

# Predict and evaluate
y_pred = clf.predict_proba(X_test_combined)[:, 1]
auc = roc_auc_score(y_test, y_pred)
print(f"AUC: {auc:.4f}")

AUC: 0.7728


In [14]:
# Select relevant columns
features = df.drop(['Timely response?'],axis=1)
target = df['Timely response?'].apply(lambda x: 1 if x == 'Yes' else 0)

# Function to handle redacted information
def handle_redacted(text, strategy="replace"):
    if strategy == "replace":
        return re.sub(r'XX+', '<REDACTED>', text)
    elif strategy == "remove":
        return re.sub(r'XX+', '', text)
    else:
        return text

# Apply the redacted handling function
features['Consumer complaint narrative'] = features['Consumer complaint narrative'].apply(lambda x: handle_redacted(x, strategy="replace"))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# TF-IDF Vectorization for complaint narratives
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train['Consumer complaint narrative'])
X_test_tfidf = vectorizer.transform(X_test['Consumer complaint narrative'])

# Ordinal Encoding for companies
# Combine training and testing data for encoding
combined_data = pd.concat([X_train[['Company']], X_test[['Company']]])

# Initialize OrdinalEncoder with handle_unknown='use_encoded_value' (optional)
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit and transform on combined data
combined_encoded = ordinal_encoder.fit_transform(combined_data)

# Split back into training and testing sets
X_train_companies_encoded = combined_encoded[:len(X_train)]
X_test_companies_encoded = combined_encoded[len(X_train):]

# Combine TF-IDF vectors and company encodings
X_train_combined = hstack([X_train_tfidf, X_train_companies_encoded.reshape(-1, 1)])
X_test_combined = hstack([X_test_tfidf, X_test_companies_encoded.reshape(-1, 1)])
# Combine TF-IDF vectors and company encodings
X_train_combined = hstack([X_train_tfidf, X_train_companies_encoded])
X_test_combined = hstack([X_test_tfidf, X_test_companies_encoded])

# Train the Decision Tree model
clf = LogisticRegression(random_state=42)
clf.fit(X_train_combined, y_train)

# Predict and evaluate
y_pred = clf.predict_proba(X_test_combined)[:, 1]
auc = roc_auc_score(y_test, y_pred)
print(f"AUC: {auc:.4f}")

AUC: 0.7997


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


So far the AUC scores are:
- DecisionTreeClassifier: 0.5478
- RandomForestClassifier: 0.7728
- LogisticRegression: 0.7997

Default Logistic Regression seems to provide the best results and the decision tree is the worst, so we will Logistic Regression compare with Random Forest from now on.

For the following models, I implemented the following changes:
- features are now
    - 'Company': ordinal encoding
    - 'Product': ordinal encoding
    - 'Sub-product': ordinal encoding, drop the 4 null rows
    - 'Consumer complaint narrative': remove X's and handle dates. Then use TfidfVectorizer.
- models will have the parameter <code>class_weight='balanced'</code> to ensure the imbalanced data is accounted for
- pipelines will be implemented for person ease of use
    - a grid search of hyperparameters will be attempted with Logistic regression as well as undersampling, but the default parameters will also be noted in case the model complexity is detrimental to new unseen data.

In [15]:
# Function to handle redacted information
def handle_redacted(text, strategy="replace"):
    if not isinstance(text, str):
        return text  # Return as is if not a string
    if strategy == "replace":
        text = re.sub(r'XX+', '<REDACTED>', text)
    elif strategy == "remove":
        text = re.sub(r'XX+', '', text)
    # Handle dates
    text = re.sub(r'\b\d{2}/\d{2}/\d{4}\b', '<DATE>', text)
    return text

# Custom transformer for handling redacted and sensitive information
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, strategy="replace"):
        self.strategy = strategy

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.apply(lambda text: handle_redacted(text, self.strategy))

# Select relevant columns
features = df[['Company', 'Product' ,'Sub-product', 'Consumer complaint narrative']]
target = df['Timely response?'].apply(lambda x: 1 if x == 'Yes' else 0)
features['Sub-product'] = features['Sub-product'].fillna('Unknown')

# Define text preprocessing pipeline
text_pipeline = Pipeline([
    ('redacted_handling', TextPreprocessor(strategy="remove")),
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000))
])

# Define column transformer for categorical features
categorical_features = ['Company', 'Product' ,'Sub-product']
categorical_pipeline = Pipeline([
    ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

# Combine text and categorical features
preprocessor = ColumnTransformer([
    ('text', text_pipeline, 'Consumer complaint narrative'),
    ('categorical', categorical_pipeline, categorical_features)
])

# Define the full pipeline with undersampling
pipeline_lr = make_pipeline(
    preprocessor,
    RandomUnderSampler(random_state=42),
    LogisticRegression(random_state=42, class_weight='balanced', max_iter=100000)
)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'logisticregression__C': [0.1, 1.0, 10.0],
    'logisticregression__solver': ['liblinear', 'lbfgs']
}

# Perform grid search
grid_search_lr = GridSearchCV(pipeline_lr, param_grid, cv=5, scoring='roc_auc')

# Fit Grid Search
grid_search_lr.fit(X_train, y_train)

# Predict with Logistic Regression
y_pred_lr = grid_search_lr.predict_proba(X_test)[:, 1]
auc_lr = roc_auc_score(y_test, y_pred_lr)
print(f"Logistic Regression AUC: {auc_lr:.4f}")

# Print best parameters
print(f"Best parameters: {grid_search_lr.best_params_}")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features['Sub-product'] = features['Sub-product'].fillna('Unknown')


Logistic Regression AUC: 0.8469
Best parameters: {'logisticregression__C': 1.0, 'logisticregression__solver': 'liblinear'}


In [10]:
# Update pipeline to use Logistic Regression
pipeline_lr = make_pipeline(preprocessor, LogisticRegression(random_state=42, class_weight='balanced', max_iter=100000))

# Train the Logistic Regression pipeline
pipeline_lr.fit(X_train, y_train)

# Predict with Logistic Regression
y_pred_lr = pipeline_lr.predict_proba(X_test)[:, 1]
auc_lr = roc_auc_score(y_test, y_pred_lr)
print(f"Logistic Regression AUC: {auc_lr:.4f}")

Logistic Regression AUC: 0.8524


In [11]:
from sklearn.ensemble import RandomForestClassifier

# Update pipeline to use Random Forest
pipeline_rf = make_pipeline(preprocessor, RandomForestClassifier(random_state=42, class_weight='balanced'))

# Train the Random Forest pipeline
pipeline_rf.fit(X_train, y_train)

# Predict with Random Forest
y_pred_rf = pipeline_rf.predict_proba(X_test)[:, 1]
auc_rf = roc_auc_score(y_test, y_pred_rf)
print(f"Random Forest AUC: {auc_rf:.4f}")

Random Forest AUC: 0.8083


## Conclusion

The AUC score are summarized:

| model (class_weight='balanced')                         | AUC    |
|---------------------------------------------------------|--------|
| Logistic Regression with grid search and under sampling | 0.8469 |
| Logistic Regression                                     | 0.8524 |
| Random Forest                                           | 0.8083 |



Recall that the Area Under the ROC curve is a value between 0 and 1 that summarizes the ROC curve (plot of the true positive rate against the false positive rate), where 0.5 AUC represents a model that is random.

Logistic Regression seems to be the best performing modeling. We would need to balance having a complex model which has the chance to overfit and using the simpler model which may not capture all important features. Here it seems that the simple Logistic Regression model with default parameters (aside from balanced class weight) is the best choice. 

## Further Steps

The models can be applied to more years in the range of complaints to the Consumer Financial Protection Bureau. Complaints with narratives only account for 42% of all complaints in 2022, so we may want to consider the remaining data. This would point to particular companies or products that are more likely to have untimely responses, although this may not require machine learning as NLP is not required for 58% of the data. 

Outside this dataset, we can look at whether or not the dates could be related to why some companies did not offer a timely response. Perhaps some companies were going through bankruptcy and so did not have consumer complaints as a priority. 