Hello, this is our Data Science Studio Final Project. Today, we will be performing an analysis and creating mutliple classification models using the DCWP Consumer Complaints Dataset. Consumer complaints are an important signal for identifying patterns of dissatisfaction, fraud, or misconduct among businesses. By analyzing consumer complaints, agencies and companies can improve cutomer service, ensure compliance with regulations, and protect consumers from unfair practices.

### The aim of this analysis

Today, we will assume the role of data analysts working for a consumer protection agency. We need to analyze the DCWP Consumer Complaints dataset and develop classification models that could accurately predict the status of a consumer complaint based on various factors such as the type of business, type of complaint, and subission methods. We will also evaluate the different models to see which one performs better. Some of the questions we aim to answer are:
- Which features are most import in determining the final status of a consumer complaint?
- Can we identify any patterns or trends in the types of complaints received across different business categories?
- Can we build a predictive model that can accurately classify the complaint status based on the available information?

### Introducing the Dataset

The dataset we will be using for this project is the DCWP Consumer Complaints Datset, which we retrieved from the NYC Open Data Portal using an API. The Department of Consumer and WOrker Protection (DCWP) records complaints filed by consumers against businesses operating in New York City. This dataset provides valuable insights into consumer issues and business compliance across various inductries.

The features we will be using in this project are:

Let's start by exploring the dataset!

In [None]:
# Importing necessary libraries

from sodapy import Socrata
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier


In [None]:
# Creating a client object to make API requests

client = Socrata("data.cityofnewyork.us", None)



In [None]:
# The ID of the dataset

dataset_id = "nre2-6m2s"

In [None]:
# Getting the data

results = client.get(dataset_id, limit = 5000)

In [None]:
# Putting it into a dataframe

df = pd.DataFrame.from_records(results)

In [148]:
df.head()

Unnamed: 0,record_id,intake_date,intake_channel,_311_sr_number,business_category,complaint_code,business_unique_id,business_name,result_date,result,...,census_block_2010_,census_tract_2010_,latitude,longitude,street2,apt_suite,unit_type,refund_amount,street3,contract_cancelled_amount
0,057329-2025-CMPL,2025-02-24T00:00:00.000,311,311-22038149,Restaurant,Price Gouging,BA-1722078-2025,pateizia restaurant,2025-02-24T00:00:00.000,Referred,...,8002,66,40.73991700211976,-73.97939843013278,,,,,,
1,057324-2025-CMPL,2025-02-24T00:00:00.000,311,311-22036709,Supermarket,Overcharge,BA-1722116-2025,A & Y Embassy Food Corp.,2025-02-24T00:00:00.000,Referred,...,3005,539,40.707436555829005,-73.9154949964182,,,,,,
2,057319-2025-CMPL,2025-02-24T00:00:00.000,311,311-22036401,Misc Non-Food Retail,Non-Delivery of Goods - N01,BA-1722067-2025,BURGER KING,2025-02-24T00:00:00.000,Complaint Review Complete,...,1006,109,40.74997771919704,-73.98792375849172,,,,,,
3,057305-2025-CMPL,2025-02-23T00:00:00.000,311,311-22034563,Supermarket,Non-Delivery of Goods - N01,BA-1722054-2025,OCEAN BAY MARKET INC.,2025-02-24T00:00:00.000,Referred,...,1000,392,40.59765876127,-73.9611496265785,,,,,,
4,057273-2025-CMPL,2025-02-23T00:00:00.000,311,311-22025951,Dry Cleaners,Lost/Stolen/Damaged Property,BA-1722025-2025,Vital tailor shop,2025-02-24T00:00:00.000,Insufficient Info Received,...,2001,101,40.65667828908096,-74.00194621940757,,,,,,


In [None]:
# Checking the initial shape of the dataset

df.shape

(5000, 33)

In [None]:
# The columns

df.columns.tolist()

['record_id',
 'intake_date',
 'intake_channel',
 '_311_sr_number',
 'business_category',
 'complaint_code',
 'business_unique_id',
 'business_name',
 'result_date',
 'result',
 'referred_to',
 'address_type',
 'building_nbr',
 'street1',
 'city',
 'state',
 'postcode',
 'borough',
 'community_board',
 'council_district',
 'bin',
 'bbl',
 'nta',
 'census_block_2010_',
 'census_tract_2010_',
 'latitude',
 'longitude',
 'street2',
 'apt_suite',
 'unit_type',
 'refund_amount',
 'street3',
 'contract_cancelled_amount']

### Pre-processing the Data

Before building a classification model, it is important to preprocess the data in a careful matter to ensure it is clean, consistent, and suitable for the algorithms we will be using in this porject.

We will preprocess the data by handling missing values, dropping irrelevant features, simplifying the target variable, encode categorical variables, and doing feature engineering!

In [151]:
# Checking and sorting missing values
df.isnull().sum().sort_values(ascending = False)

contract_cancelled_amount    4998
street3                      4996
refund_amount                4931
street2                      4846
unit_type                    4795
apt_suite                    4595
referred_to                  2814
_311_sr_number               1427
bbl                           815
bin                           815
nta                           697
census_block_2010_            697
complaint_code                673
building_nbr                  643
census_tract_2010_            555
community_board               555
council_district              555
longitude                     509
latitude                      509
borough                       489
city                          191
business_unique_id            145
business_name                 144
state                          51
postcode                       50
street1                        50
address_type                   45
business_category               8
record_id                       0
result        

In [152]:
# Dropping irrelevant columns
# Feature Selection
columns_to_drop = [
    'record_id', '_311_sr_number', 'business_unique_id', 'building_nbr',
    'street1', 'street2', 'street3', 'apt_suite', 'bin', 'bbl', 'latitude', 'longitude',
    'borough', 'community_board', 'council_district', 'nta',
    'census_block_2010_', 'census_tract_2010_', 'address_type', 'state', 'postcode', 'referred_to', 'city'
]

df = df.drop(columns=columns_to_drop, errors='ignore')

# Checking new shape
# we can see that we have 10 columns after we drop the list
df.shape

(5000, 10)

In [153]:
# Checking for nulls after removing irrelevant features
df.isnull().sum().sort_values(ascending = False)

contract_cancelled_amount    4998
refund_amount                4931
unit_type                    4795
complaint_code                673
business_name                 144
business_category               8
intake_channel                  0
intake_date                     0
result_date                     0
result                          0
dtype: int64

In [154]:
# Dropping columns with 50% missing values by setting a threshold
threshold = len(df) * 0.5
df = df.dropna(thresh=threshold, axis=1)

# filling missing values for categorical columns with 'Unknown'
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].fillna('Unknown')

In [155]:
# final check for nulls
# we can see that we do not have missing values anymore 
df.isnull().sum()

intake_date          0
intake_channel       0
business_category    0
complaint_code       0
business_name        0
result_date          0
result               0
dtype: int64

In [156]:
# Feature engineering
# Creating the column resolution_days to see how many days the issue took to be resolved

df['intake_date'] = pd.to_datetime(df['intake_date'], errors = 'coerce')
df['result_date'] = pd.to_datetime(df['result_date'], errors = 'coerce')

df['resolution_days'] = (df['result_date']-df['intake_date']).dt.days

In [157]:
# Simplifying the target class variable to improve model performance
# this helps balance the dataset and avoid overfitting to very small classes

positive_responses = ['Resolved', 'Reduced', 'Goods', 'Store Credit', 'Cash Amount', 'Took Action', 'Consumer Restitution']

df['vendor_responded'] = df['result'].apply(lambda x: 1 if any(keyword.lower() in x.lower() for keyword in positive_responses) else 0)

In [158]:
# Initializing both Label Encoders

le_business_category = LabelEncoder()
le_complaint_code = LabelEncoder()

# Applying Label Encoding to categorical columns needed

df['business_category_encoded'] = le_business_category.fit_transform(df['business_category'].astype(str))
df['complaint_code_encoded'] = le_business_category.fit_transform(df['complaint_code'].astype(str))

In [159]:
# Creating a list of the features we want to use
features = ['business_category_encoded', 'complaint_code_encoded', 'resolution_days']

# Defining the features (X) and the target variable (y)

X = df[features]
y = df['vendor_responded'] 

In [160]:
# Splitting the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify=y) 

# checking the sizes of both the training and testing data
# same amount of columns
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

Training set size: (4000, 3)
Testing set size: (1000, 3)


In [161]:
# Creating a Random Forest Classifier

rf_model = RandomForestClassifier(random_state = 42) # Initializing
rf_model.fit(X_train, y_train) # Training

y_pred_rf = rf_model.predict(X_test) # Making predictions

# Evaluating the model

print("Random Forest Classifier Results:")
print("----------------------------------")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred_rf))

Random Forest Classifier Results:
----------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       976
           1       0.31      0.17      0.22        24

    accuracy                           0.97      1000
   macro avg       0.64      0.58      0.60      1000
weighted avg       0.96      0.97      0.97      1000

Confusion Matrix:
 [[967   9]
 [ 20   4]]

Accuracy Score: 0.971


In [162]:
# Creating a Logistic Regression model

logreg_model = LogisticRegression(max_iter=1000, random_state=42) # Initializing
logreg_model.fit(X_train, y_train) # Training

y_pred_logreg = logreg_model.predict(X_test) # Making predictions

# Evaluating the model

print("\nLogistic Regression Results:")
print("-----------------------------")
print(classification_report(y_test, y_pred_logreg))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_logreg))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred_logreg))



Logistic Regression Results:
-----------------------------
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       976
           1       1.00      0.04      0.08        24

    accuracy                           0.98      1000
   macro avg       0.99      0.52      0.53      1000
weighted avg       0.98      0.98      0.97      1000

Confusion Matrix:
 [[976   0]
 [ 23   1]]

Accuracy Score: 0.977


In [164]:
# Creating a XGBoost model

xgb_model = XGBClassifier(eval_metric='mlogloss', random_state=42) # Initializing
xgb_model.fit(X_train, y_train) # Training

y_pred_xgb = xgb_model.predict(X_test) # Making predictions

# Evaluating the model

print("\nXGBoost Classifier Results:")
print("----------------------------")
print(classification_report(y_test, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred_xgb))




XGBoost Classifier Results:
----------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       976
           1       0.36      0.33      0.35        24

    accuracy                           0.97      1000
   macro avg       0.67      0.66      0.67      1000
weighted avg       0.97      0.97      0.97      1000

Confusion Matrix:
 [[962  14]
 [ 16   8]]

Accuracy Score: 0.97
