# **Binary Classification for Imbalanced Data**

I have data on 31K historical train records spanning over a period of 3 days. I need to create a model for classification. "Delayed" feature is a binary target column, 0 indicates "No Delay" and "1" indicates "Delay".  

First, since the dataset is in nested format, I normalized "Stop" feature and included "Headcode" from the columns.


In [122]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import StratifiedKFold, train_test_split, GridSearchCV
from sklearn.metrics import classification_report
import json

Read historical dataset:

In [14]:
with open('./raildataset.json','r') as f:
    data = json.loads(f.read())


In [17]:
normalized_df = pd.json_normalize(
    data,
    record_path=['Stops'],
    meta=['Headcode'], errors='ignore'
)


In [18]:
print(normalized_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31727 entries, 0 to 31726
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Name                         31727 non-null  object
 1   Crs                          31727 non-null  object
 2   Tiploc                       31727 non-null  object
 3   DwellBooked                  31727 non-null  int64 
 4   DwellActual                  31727 non-null  int64 
 5   DepartureDiff                31727 non-null  int64 
 6   ArrivalDiff                  31727 non-null  int64 
 7   DwellDiff                    31727 non-null  int64 
 8   StationNumber                31727 non-null  int64 
 9   UntilNextLocationActualTime  31727 non-null  int64 
 10  UntilNextLocationBookedTime  31727 non-null  int64 
 11  UntilNextLocationTimeDiff    31727 non-null  int64 
 12  Delayed                      31727 non-null  int64 
 13  BookedDeparture.$date        31

Read coordinates dataset and merge it with historical:

In [82]:
with open('./station_coordinates5.json','r') as f:
    coordinates = json.loads(f.read())

In [83]:
dfcoord = pd.json_normalize(
    coordinates,
    record_path=['stations'],
    errors='ignore'
)
dfcoord.head()

Unnamed: 0,name,X,Y,crs
0,Hildenborough,0.229,51.215,HLB
1,Swale,0.746,51.389,SWL
2,Kemsley,0.735,51.361,KML
3,Sheerness-on-Sea,0.758,51.441,SSS
4,Queenborough,0.75,51.416,QBR


In [85]:
dfcoord.rename(columns={"crs": "CRS"}, inplace=True)
normalized_df.rename(columns={"Crs": "CRS"}, inplace=True)

In [88]:
normalized_df = pd.merge(normalized_df, dfcoord, on='CRS', how='left')


Eliminated null values from the dataset:


In [106]:
normalized_df = normalized_df.dropna()


Eliminated null date values:

In [108]:
normalized_df = normalized_df[
    (normalized_df['ActualDeparture.$date'] != '0001-01-03T00:00:00.000Z') &
    (normalized_df['ActualArrival.$date'] != '0001-01-03T00:00:00.000Z')
]

Converted date columns to date format since they are in object format now:

In [109]:
normalized_df['BookedArrival.$date'] = pd.to_datetime(normalized_df['BookedArrival.$date'])

Created sin_time and cos_time features as the nature of date is cyclical (To explore more about cyclical features, check my work "Encoding cyclical features"):

In [110]:
normalized_df['hour'] = normalized_df['BookedArrival.$date'].dt.hour
normalized_df['minute'] = normalized_df['BookedArrival.$date'].dt.minute
normalized_df['second'] = normalized_df['BookedArrival.$date'].dt.second
normalized_df['hourfloat']=normalized_df.hour+normalized_df.minute/60.0+ normalized_df.second/3600.0

normalized_df['sin_time']=np.sin(2.*np.pi*normalized_df.hourfloat/24.)
normalized_df['cos_time']=np.cos(2.*np.pi*normalized_df.hourfloat/24.)

Created day_of_week and encoded it to use as an input value in classification:

In [111]:
normalized_df['day_of_week'] = normalized_df['BookedArrival.$date'].dt.day_name()

lblEncoder_day = LabelEncoder()
normalized_df['day_of_week'] = lblEncoder_day.fit_transform(normalized_df['day_of_week'])
print(lblEncoder_day.classes_)
print(lblEncoder_day.transform(lblEncoder_day.classes_))

['Monday' 'Thursday' 'Tuesday' 'Wednesday']
[0 1 2 3]


Check class weights:

In [112]:
target_counts = normalized_df['Delayed'].value_counts()

print(target_counts)

0    16550
1    10872
Name: Delayed, dtype: int64


I used "compute_class_weight" function from scikit-learn, which calculates class weights. A lower weight assigned to a class implies its greater prevalence, while a higher weight signifies a less frequent class occurrence.

In [113]:
normalized_df['Delayed'] = normalized_df['Delayed'].astype(int)

class_weights = compute_class_weight('balanced', classes=normalized_df['Delayed'].unique(), y=normalized_df['Delayed'])

class_weights_dict = dict(zip(normalized_df['Delayed'].unique(), class_weights))

print("Class Weights:")
print(class_weights_dict)

Class Weights:
{1: 1.2611295069904342, 0: 0.8284592145015106}


As we see, the target value is quite imbalanced.  As evident from the output, class 1 is minority, while class 0 is majority. This implies that the model will be more penalized for misclassifying samples of class 1. The goal is to balance the impact of the minority class (class 1) against the majority class (class 0), allowing the model to learn to distinguish both classes effectively.

To address this imbalance, I performed a technique called Random Over-sampling to generate additional samples for the minority classes, thereby equalizing the instance counts across all classes

In [114]:
pip install imbalanced-learn




In [115]:
normalized_df.head()

Unnamed: 0,Name,CRS,Tiploc,DwellBooked,DwellActual,DepartureDiff,ArrivalDiff,DwellDiff,StationNumber,UntilNextLocationActualTime,...,second,hourfloat,sin_time,cos_time,name_x,X_x,Y_x,name_y,X_y,Y_y
0,London Bridge,LBG,LNDNBDE,60,44,44,60,-16,2,1306,...,0,23.633333,-0.095846,0.995396,London Bridge,-0.086,51.503,London Bridge,-0.086,51.503
1,Swanley,SAY,SWLY,30,39,39,30,9,3,468,...,0,0.016667,0.004363,0.99999,Swanley,0.168,51.393,Swanley,0.168,51.393
2,Otford,OTF,OTFORD,30,37,94,87,7,4,418,...,30,0.141667,0.03708,0.999312,Otford,0.197,51.313,Otford,0.197,51.313
3,London Bridge,LBG,LNDNBDE,60,44,45,61,-16,2,1305,...,0,23.633333,-0.095846,0.995396,London Bridge,-0.086,51.503,London Bridge,-0.086,51.503
4,Swanley,SAY,SWLY,30,39,39,30,9,3,469,...,0,0.016667,0.004363,0.99999,Swanley,0.168,51.393,Swanley,0.168,51.393


In [116]:
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from sklearn.utils.class_weight import compute_class_weight

cols = ['DwellBooked', 'DwellActual', 'StationNumber', 'sin_time', 'cos_time','day_of_week','X_y','Y_y']
X = normalized_df[cols]

y = normalized_df['Delayed']
y = y.to_frame()

#Convert 'DelayCategory' column to integers if it's not already
y['Delayed'] = y['Delayed'].astype(int)

#Apply oversampling
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

#Convert the oversampled arrays back to DataFrame
X_resampled = pd.DataFrame(X_resampled, columns=cols)
y_resampled = pd.DataFrame(y_resampled, columns=['Delayed'])

#XCheck the class distribution after oversampling
print("Class Distribution after Oversampling:")
print(y_resampled['Delayed'].value_counts())


Class Distribution after Oversampling:
1    16550
0    16550
Name: Delayed, dtype: int64


After training the model using the resampled data, it created a larger dataset consisting of 16,687 for both classes.

Split dataset into training and test with the ratio of 0.80 - 0.20:

In [117]:
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X_resampled, y_resampled, test_size=0.2, stratify=y_resampled, random_state=42
)



Applied Random Forest since it is good when we have imbalanced dataset:

In [118]:
#Evaluating feature importance.
clf = RandomForestClassifier(n_estimators=200)
clf = clf.fit(Xtrain, ytrain)
indices = np.argsort(clf.feature_importances_)[::-1]

print('Feature ranking:')

for f in range(Xtrain.shape[1]):
    print('%d. feature %d %s (%f)' % (f + 1, indices[f], Xtrain.columns[indices[f]],
                                      clf.feature_importances_[indices[f]]))

  clf = clf.fit(Xtrain, ytrain)


Feature ranking:
1. feature 3 sin_time (0.194151)
2. feature 1 DwellActual (0.184447)
3. feature 4 cos_time (0.176006)
4. feature 7 Y_y (0.113438)
5. feature 2 StationNumber (0.106189)
6. feature 6 X_y (0.105908)
7. feature 5 day_of_week (0.079373)
8. feature 0 DwellBooked (0.040488)


Some features have less importance, however, I don't eliminate them due to the class imbalance, as less important factors could be important for predicting "1".



In [119]:
clf = RandomForestClassifier(n_estimators=150, n_jobs=-1, criterion = 'gini', max_features = 'sqrt',
                             min_samples_split=7, min_weight_fraction_leaf=0.0,
                             max_leaf_nodes=40, max_depth=10)

calibrated_clf = CalibratedClassifierCV(clf, method='sigmoid', cv=5)
calibrated_clf.fit(Xtrain, ytrain)
y_val = calibrated_clf.predict_proba(Xtest)

  y = column_or_1d(y, warn=True)
  estimator.fit(X_train, y_train, **fit_params_train)
  estimator.fit(X_train, y_train, **fit_params_train)
  estimator.fit(X_train, y_train, **fit_params_train)
  estimator.fit(X_train, y_train, **fit_params_train)
  estimator.fit(X_train, y_train, **fit_params_train)


In [120]:
print(classification_report(ytest, pd.DataFrame(y_val).idxmax(axis=1).values, target_names=['0', '1'], digits=4))


              precision    recall  f1-score   support

           0     0.7217    0.6846    0.7026      3310
           1     0.7000    0.7360    0.7175      3310

    accuracy                         0.7103      6620
   macro avg     0.7108    0.7103    0.7101      6620
weighted avg     0.7108    0.7103    0.7101      6620



Now we can see that the models give good predictions for both classes, and the overall accuracy is 71%.