# Uber Trip Classification EDA and Modeling
Oleh : [enricoroselino](https://www.linkedin.com/in/enricoroselino/)

Dataset : [DTS Google Tensorflow 2 Demo day - GGU](https://www.kaggle.com/datasets/mnavas/taxi-routes-for-mexico-city-and-quito/code)

Tujuan :
* Membuat model yang dapat memprediksi kesalahan dalam pencatatan data perjalanan Uber.

Masalah :
* Tidak ada kolom label untuk menjadi target.
* Data tanggal tidak ada notasi AM / PM.

Landasan Teori :
* Batas kecepatan di kawasan perkotaan adalah 50 km/jam, kawasan permukiman adalah 30 km/jam, kawasan jalanan antar kota paling rendah adalah 60 km/jam dan kecepatan tol dalam kota paling cepat adalah 80 km/jam. ([Kumparan](https://kumparan.com/info-otomotif/batas-kecepatan-untuk-dalam-kota-begini-aturannya-1xvq35hLvXP/3), [Otomotif Kompas](https://otomotif.kompas.com/read/2022/06/20/191100215/batas-kecepatan-berkendara-di-jalan-tol-tidak-semua-sama))
* Waktu tunggu maksimal adalah 5 menit. ([therideshareguy](https://therideshareguy.com/uber-extends-wait-time/))


Label Feature Description :
* 0 = Trip not valid
* 1 = Trip valid

Kesimpulan :
* Dapat disimpulkan model **Deep Neural Network** lebih andal dengan **f1 score : 56.37%** pada data test dibandingkan dengan model lainnya. Dengan tingkat false positive yang cukup rendah 0.18 dan false negative 0.42, model dapat diandalkan karena **lebih sedikit kesalahan** memprediksi perjalanan yang **sebenarnya valid dan tidak valid**.

* Sebagai alternatif, model **Random Forest Classifier** memiliki **f1 score : 54.42%** pada data test. Dengan pemodelan yang lebih ringkas namun performanya tidak jauh berbeda dengan Deep Neural Network. Model ini menghasilkan **false positive yang lebih sedikit**, yang berarti akan **minim kesalahan penagihan** dan false negative dapat diverifikasi kembali total tagihan yang harus dibayar oleh customer.

## Import Libs

In [1]:
import os
import math
import re
import pandas as pd
from datetime import date, datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.distance import distance
from sklearn.preprocessing import StandardScaler
from imblearn.combine import SMOTETomek
from fast_ml.model_development import train_valid_test_split

%matplotlib inline
sns.set()

KeyboardInterrupt: 

## Load Data

In [None]:
bog_path = os.path.join("dataset", "bog_clean.csv")
mex_path = os.path.join("dataset", "mex_clean.csv")
uio_path = os.path.join("dataset", "uio_clean.csv")

In [None]:
bog_df = pd.read_csv(bog_path)
mex_df = pd.read_csv(mex_path)
uio_df = pd.read_csv(uio_path)

## Explore Data

In [None]:
print(bog_df.shape)
print(mex_df.shape)
print(uio_df.shape)

In [None]:
# check bog unique value
for i in bog_df[["id", "vendor_id", "store_and_fwd_flag"]] :
    print(f"{i} : {bog_df[i].nunique()}")

In [None]:
# check mex unique value
for i in mex_df[["id", "vendor_id", "store_and_fwd_flag"]] :
    print(f"{i} : {mex_df[i].nunique()}")

In [None]:
# check uio unique value
for i in uio_df[["id", "vendor_id", "store_and_fwd_flag"]] :
    print(f"{i} : {uio_df[i].nunique()}")

Dapat disimpulkan bahwa tidak ada dupilkat dari kolom `id`, `store_and_fwd_flag` hanya memiliki satu nilai unik dan `datetime` masih dalam tipe data objek pada setiap dataset

In [None]:
bog_df["country"] = "colombia"
mex_df["country"] = "mexico"
uio_df["country"] = "equador"

In [None]:
uber_df = pd.concat([bog_df, mex_df, uio_df], ignore_index=True)

In [None]:
uber_df = uber_df.drop(["id", "store_and_fwd_flag"], axis=1)

In [None]:
uber_df.describe().T

Terdapat nilai yang tidak realistis minimal dan maksimal pada `trip_duration`, `dist_meters`, dan `wait_sec	`

<a id="section-one"></a>
### Visualisasi Data

Membuat visualisasi untuk menjelaskan sebaran data yang tidak realistis / bermasalah

In [None]:
plt.figure(figsize=(9, 7))
sns.scatterplot(
    data=uber_df,
    x=uber_df.dist_meters/1000,
    y=uber_df.trip_duration/3600,
    s=100,
    alpha=0.85
)

plt.title("Durasi vs Jarak Tempuh Trip",
    loc="right",
    fontweight="bold",
    size=15
)

plt.xlabel("KM")
plt.ylabel("jam")
plt.show()

Waktu dan jarak tempuh seharusnya tidak ada yang negatif, waktu tempuh paling lama adalah 1.94e+4 jam, serta jarak tempuh paling lama adalah 2.15e+6 KM.

In [None]:
plt.figure(figsize=(9, 7))
sns.violinplot(x=uber_df["wait_sec"]/3600)

plt.title(
    "Distribusi Waktu Tunggu Driver",
    loc="right",
    fontweight="bold",
    size=15
)
plt.xlabel("jam")
plt.show()

Waktu driver menunggu penumpang juga terlampau lama yaitu 2.6e+7 jam

In [None]:
# calculate speed in km/h
speed_kmph = (uber_df["dist_meters"] / 1000) / (uber_df["trip_duration"] / 3600)
speed_kmph.describe()

In [None]:
# plot kmph distribution
plt.figure(figsize=(9, 7))
sns.violinplot(x=speed_kmph)

plt.title(
    "Distribusi Kecepatan Rata-Rata Trip",
    loc="right",
    fontweight="bold",
    size=15
)
plt.xlabel("km / jam")
plt.show()

Kecepatan kendaraan juga terdapat data negatif dan yang paling cepat adalah 5.94e+06

In [None]:
# mengubah format date and time
uber_df["pickup_datetime"] = pd.to_datetime(uber_df["pickup_datetime"], format="%Y/%m/%d %H:%M:%S")
uber_df["dropoff_datetime"] = pd.to_datetime(uber_df["dropoff_datetime"], format="%Y/%m/%d %H:%M:%S")

In [None]:
# check max and minimum time
print("pickup & dropoff maximum time is {} {}".format(
    uber_df["pickup_datetime"].dt.time.max(), 
    uber_df["dropoff_datetime"].dt.time.max())
)

print("pickup & dropoff minimum time is {} {}".format(
    uber_df["pickup_datetime"].dt.time.min(), 
    uber_df["dropoff_datetime"].dt.time.min())
)

Waktu perjalanan saat diterima dan selesai tidak dalam format 24H serta tidak terdapat notasi AM / PM 

## CLEANING, PREPROCESSING, FEATURE ENGINEERING

### *_datetime
Menghilangkan time dari tanggal untuk menghindari kesalahan dalam kalkulasi

In [None]:
uber_df["pickup_date"] = pd.to_datetime(uber_df["pickup_datetime"]).dt.date
uber_df["dropoff_date"] = pd.to_datetime(uber_df["dropoff_datetime"]).dt.date
uber_df = uber_df.drop(["pickup_datetime", "dropoff_datetime"], axis=1)

Menghitung lama perjalanan berdasarkan hari

In [None]:
def day_delta(df) :
    day = []
    for i in range(len(df)) :
        delta = (df.dropoff_date[i] - df.pickup_date[i]).days
        day.append(abs(int(delta)))
    return day

In [None]:
uber_df["day_delta"] = day_delta(uber_df)

In [None]:
uber_df = uber_df.drop(["pickup_date", "dropoff_date"], axis=1)

### est_meters
Membuat estimasi jarak menyetir sesungguhnya, bukan jarak antar 2 titik koordinat

In [None]:
def geodesic(p_lon, p_lat, d_lon, d_lat) :
    # calculate distance using geodesic method
    # COEF is a coeficient for calibrating the geodesic result to nearly matches OSRM driving distance
    # distance in meters
    COEF = 1.5165
    pickup = (p_lat, p_lon)
    dropoff = (d_lat, d_lon)
    result = distance(pickup, dropoff).km
    return result * COEF * 1000

def distance_estimator(df) :
    # calculate the duration then append to est_duration
    # using geodesic
    # name the estimated distance to est_meters
    distance = []
    for i in range(len(df)) :
        PICKUP_LONG = df.pickup_longitude[i]
        PICKUP_LAT = df.pickup_latitude[i]
        DROPOFF_LONG = df.dropoff_longitude[i]
        DROPOFF_LAT = df.dropoff_latitude[i]
        result = geodesic(PICKUP_LONG, PICKUP_LAT, DROPOFF_LONG, DROPOFF_LAT)
        distance.append(math.ceil(result))
    return distance

In [None]:
uber_df["est_meters"] = distance_estimator(uber_df)

### est_duration
Membuat estimasi durasi perjalanan sesungguhnya


In [None]:
def duration_estimator(df) :
    # rata-rata kecepatan 40 km/h
    time = []
    v = 40 * (1000/3600) # average speed in m/s
    for i in range(len(df)) :
        d = df.est_meters[i]
        t = d / v # time travel in seconds
        time.append(math.ceil(t))
    return time

In [None]:
uber_df["est_duration"] = duration_estimator(uber_df)

### avg_kmph
Membuat estimasi rata-rata kecepatan

In [None]:
def avg_kmph(df) :
    speed = []
    for i in range(len(df)) : 
        METERS = df.dist_meters[i]
        DURATION = df.trip_duration[i]
        result = (METERS / 1000) / (DURATION / 3600)
        speed.append(round(abs(result), 4))
    return speed

In [None]:
uber_df["avg_kmph"] = avg_kmph(uber_df)

### diff_meters & diff_duration
Menghitung perbedaan data estimasi dan tercatat

In [None]:
def diff(df) :
    meters = []
    duration = []
    for i in range(len(df)) : 
        EST_METERS = abs(df.est_meters[i])
        RECORDED_METERS = abs(df.dist_meters[i])
        EST_DURATION = abs(df.est_duration[i])
        RECORDED_DURATION = abs(df.trip_duration[i])
        result_meters = RECORDED_METERS - EST_METERS
        result_duration = RECORDED_DURATION - EST_DURATION
        meters.append(abs(result_meters))
        duration.append(abs(result_duration))
    return meters, duration

In [None]:
uber_df["diff_meters"], uber_df["diff_duration"] = diff(uber_df)

### vendor_id
Menetapkan tipe layanan uber

In [None]:
uber_df["vendor_id"].unique()

Terdapat beberapa nama layanan yang tidak tersedia pada laman website uber akan dijadikan taxi dan sisanya akan disesuaikan dengan layanan ekivalennya

In [None]:
def services_extractor(df) :
    # extract services name from vendor_id and map the services based in 2022
    # some normal services is not available in 2022, it'll be taxi service instead
    # uberangel is exclusive to colombia, it'll be uberblack service instead
    # ubersuv will be uberxl
    SERVICE_NAME = re.compile(
        r"taxi|uberxl|uberx|uberblack|ubervan|uberangel|ubersuv"
    )
    df["vendor_id"] = df["vendor_id"].str.lower()
    service = []
    for i in range(len(df)) :
        extract = SERVICE_NAME.search(df.vendor_id[i])
        if extract != None :
            ext_group = extract.group()
            if (ext_group  == "ubervan") or (ext_group == "ubersuv"):
                service.append("uberxl")
            elif ext_group == "uberangel" :
                service.append("uberblack")
            else : 
                service.append(ext_group)
        else :
            service.append("taxi")
    return service

In [None]:
uber_df["service"] = services_extractor(uber_df)

In [None]:
uber_df["service"].unique()

In [None]:
uber_df = uber_df.drop("vendor_id", axis=1)

## DEFINE TARGET VARIABLE / LABELING
Dalam project Uber Trip Classification bertujuan untuk melakukan prediksi terhadap kesalahan aplikasi saat menyimpan data perjalanan yang diakibatkan driver lupa mengakhiri perjalanan pada aplikasi, dan mengakibatkan kesalahan dalam penagihan harga kepada customer.

Variabel `est_meters`, `est_duration` dapat dijadikan pembanding kebenaran terhadap variabel `dist_meters` dan `trip_duration` yang terdapat kesalahan dalam peyimpanan.

Perbandingan menggunakan batas atas dan batas bawah yang terdiri dari toleransi perjalanan lebih lama atau lebih cepat yang diakibatkan oleh kecepatan mobil dan keadaan lalu lintas yang tidak menentu, batas waktu driver menunggu adalah 5 menit, minimal jarak perjalanan yang dianggap valid adalah 1 KM, serta perjalanan yang terhitung 1 hari mungkin valid apabila terjadi sekitar tengah malam, tetapi >= 2 hari sudah pasti tidak valid.

In [None]:
def labeler(df) :
    label = []
    DIST_MIN = 1000
    WT = 5 * 60
    for i in range(len(df)) :
        DLB = df.est_meters[i] * 0.8 # might be closer
        DHB = df.est_meters[i] * 1.5 # might be further
        TLB = df.est_duration[i] * 0.6667 # might be faster (~ 40 km/h - 60 km/h)
        THB = df.est_duration[i] * 4 * 1.5 # might be slower (~ 10 km/h - 40 km/h) and 50% longer
        DD = df.day_delta[i]
        if DD > 1 :
            label.append(0)
        elif (df.est_meters[i] < DIST_MIN) or (df.wait_sec[i] > WT): 
            label.append(0)
        elif (df.dist_meters[i] > DLB) and (df.dist_meters[i] < DHB) :
            if (df.trip_duration[i] > TLB) and (df.trip_duration[i] < THB) :
                label.append(1)
            else :
                label.append(0)
        else :
            label.append(0)
    return label

In [None]:
uber_df["label"] = labeler(uber_df)

In [None]:
for i in range(uber_df["label"].nunique()) :
    print("label {} : {}".format(i, list(uber_df["label"].values).count(i)))

In [None]:
plt.figure(figsize=(9, 7))
sns.countplot(
    x= "label",
    data= uber_df
)

plt.title("Distribusi Label",
    loc="center",
    fontweight="bold",
    size=15
)

plt.show()

### Re-check Labeling Logic Reliability
Cek kembali apakah hasil dari labeling sudah sesuai dengan landasan teori

In [None]:
# load only True data
true_data = uber_df[(uber_df.label == 1)]

In [None]:
# plot dist_meter and trip_duratuion in hour
plt.figure(figsize=(9, 7))
sns.scatterplot(
    data=true_data,
    x=true_data.dist_meters / 1000,
    y=true_data.trip_duration / 3600,
    s=100,
    alpha=0.85
)
plt.title("Durasi vs Jarak Tempuh Trip True",
    loc="right",
    fontweight="bold",
    size=15
)
plt.xlabel("km")
plt.ylabel("jam")
plt.show()

In [None]:
true_data.avg_kmph.describe()

In [None]:
# plot kmph distribution
plt.figure(figsize=(9, 7))
sns.violinplot(x=true_data.avg_kmph)

plt.title(
    "Distribusi Kecepatan Rata-Rata Trip True",
    loc="right",
    fontweight="bold",
    size=15
)
plt.xlabel("km / jam")
plt.show()

In [None]:
# plot wait_sec distribution
plt.figure(figsize=(9, 7))
sns.violinplot(x=true_data["wait_sec"] / 60)

plt.title(
    "Distribusi Waktu Tunggu Driver True",
    loc="right",
    fontweight="bold",
    size=15
)
plt.xlabel("menit")
plt.show()

Data yang valid sudah memenuhi kriteria pada landasan teori seperti kecepatan maksimal adalah 72 km/jam dibawah batas maksimal kecepatan tol dalam kota dan waktu maksimal driver untuk menunggu tidak lebih dari 5 menit

### Missing Value Checking

In [None]:
check_missing = uber_df.isnull().sum() * 100 / uber_df.shape[0]
check_missing[check_missing > 0].sort_values(ascending=False)

Tidak ada data yang hilang / tidak sesuai jadi tidak perlu ditindaklanjuti

### Correlation Check

In [None]:
plt.figure(figsize=(14,7))
sns.heatmap(uber_df.corr(), annot=True, cmap="YlGnBu", mask=np.triu(uber_df.corr()))
plt.show()

Tidak ada feature yang berkorelasi sangat kuat terhadap label, tapi saya menawarkan untuk memilih selain *_longitude, dist_meters, avg_kmph dan diff_meters

In [None]:
model_data = uber_df.drop(["pickup_longitude", "dropoff_longitude", "dist_meters", "avg_kmph", "diff_meters"], axis=1)

## FEATURE SCALING AND TRANSFORMATION

### One Hot Encoding

In [None]:
categorical_cols = model_data.select_dtypes(include='object').columns.tolist()
ohe = pd.get_dummies(model_data[categorical_cols])

In [None]:
ohe.head()

In [None]:
model_data = pd.concat([model_data.drop(categorical_cols, axis=1), ohe], axis=1)

### Data Split

In [None]:
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(model_data, target = 'label', 
                                                                            train_size=0.6, valid_size=0.3, test_size=0.1, random_state=42)

### Standardization

In [None]:
numerical_cols = [col for col in X_train.columns.tolist() if col not in ohe.columns.tolist() + ['label']]

In [None]:
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_valid[numerical_cols] = scaler.fit_transform(X_valid[numerical_cols])
X_test[numerical_cols] = scaler.fit_transform(X_test[numerical_cols])

In [None]:
X_train.head()

### Class Balancing

In [None]:
oversample = SMOTETomek(random_state = 42, n_jobs= -1)
X_train, y_train = oversample.fit_resample(X_train, y_train)

## Machine Learning Model

In [None]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, confusion_matrix, roc_curve, roc_auc_score

### Deep Neural Network

#### Model

In [None]:
# Set memory limiter for each GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=0.01,
    mode = "min",
    patience=3,
    verbose=1,
    baseline=None,
    restore_best_weights=True
)

csv_log = tf.keras.callbacks.CSVLogger(
    os.path.join("model", "history.csv"), 
    separator=",", 
    append=False
)

class mC(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get("val_accuracy") >= 95):
            self.model.stop_training = True
limiter = mC()

In [None]:
tf.keras.backend.clear_session()

In [None]:
KR = tf.keras.regularizers.L2(
    l2=0.001
)

model = tf.keras.models.Sequential([
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dense(512, input_shape=(X_train.shape[1], ), activation="LeakyReLU", kernel_regularizer=KR),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(256, activation="LeakyReLU", kernel_regularizer=KR),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(128, activation="LeakyReLU", kernel_regularizer=KR),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation="LeakyReLU", kernel_regularizer=KR),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation="sigmoid", kernel_regularizer=KR)
])

In [None]:
model.compile(
    optimizer=Adam(learning_rate=0.0001),
    loss="binary_crossentropy",
    metrics = ["accuracy"]
)

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data = (X_valid, y_valid),
    batch_size=128,
    validation_batch_size=32,
    epochs=100, 
    verbose = 1,
    callbacks = [earlystop, csv_log, limiter]
)

#### Analyze Deep Neural Network

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))
plt.figure(figsize=(9, 7))
plt.plot(epochs, acc, 'r', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.show()

plt.figure(figsize=(9, 7))
plt.plot(epochs, loss, 'r', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

In [None]:
val_pred = model.predict(X_valid)
val_result = []
for pred in val_pred :
    if pred > 0.5 :
        val_result.append(1)
    else :
        val_result.append(0)

In [None]:
print(classification_report(list(y_valid), val_result, target_names=["0", "1"]))

In [None]:
f1 = f1_score(list(y_valid), val_result)
print(f1 * 100)

In [None]:
cm = confusion_matrix(list(y_valid), val_result)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("Normalized Deep Neural Network Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

### Random Forest

#### Model

In [None]:
rfc = RandomForestClassifier(max_depth=3, random_state=42)
rfc.fit(X_train, y_train)

In [None]:
arr_feature_importances = rfc.feature_importances_
arr_feature_names = X_train.columns.values
    
df_feature_importance = pd.DataFrame(index=range(len(arr_feature_importances)), columns=['feature', 'importance'])
df_feature_importance['feature'] = arr_feature_names
df_feature_importance['importance'] = arr_feature_importances
df_all_features = df_feature_importance.sort_values(by='importance', ascending=False)
df_all_features

#### Analyze Random Forest (1)

In [None]:
y_pred_proba = rfc.predict_proba(X_valid)[:][:,1]

df_actual_predicted = pd.concat([pd.DataFrame(np.array(y_valid), columns=['y_actual']), pd.DataFrame(y_pred_proba, columns=['y_pred_proba'])], axis=1)
df_actual_predicted.index = y_valid.index

In [None]:
fpr, tpr, tr = roc_curve(df_actual_predicted['y_actual'], df_actual_predicted['y_pred_proba'])
auc = roc_auc_score(df_actual_predicted['y_actual'], df_actual_predicted['y_pred_proba'])

plt.figure(figsize=(9, 7))
plt.plot(fpr, tpr, label='AUC = %0.4f' %auc)
plt.plot(fpr, fpr, linestyle = '--', color='k')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

In [None]:
df_actual_predicted = df_actual_predicted.sort_values('y_pred_proba')
df_actual_predicted = df_actual_predicted.reset_index()

df_actual_predicted['Cumulative N Population'] = df_actual_predicted.index + 1
df_actual_predicted['Cumulative N Bad'] = df_actual_predicted['y_actual'].cumsum()
df_actual_predicted['Cumulative N Good'] = df_actual_predicted['Cumulative N Population'] - df_actual_predicted['Cumulative N Bad']
df_actual_predicted['Cumulative Perc Population'] = df_actual_predicted['Cumulative N Population'] / df_actual_predicted.shape[0]
df_actual_predicted['Cumulative Perc Bad'] = df_actual_predicted['Cumulative N Bad'] / df_actual_predicted['y_actual'].sum()
df_actual_predicted['Cumulative Perc Good'] = df_actual_predicted['Cumulative N Good'] / (df_actual_predicted.shape[0] - df_actual_predicted['y_actual'].sum())

In [None]:
df_actual_predicted.head()

In [None]:
KS = max(df_actual_predicted['Cumulative Perc Good'] - df_actual_predicted['Cumulative Perc Bad'])

plt.figure(figsize=(9, 7))
plt.plot(df_actual_predicted['y_pred_proba'], df_actual_predicted['Cumulative Perc Bad'], color='r')
plt.plot(df_actual_predicted['y_pred_proba'], df_actual_predicted['Cumulative Perc Good'], color='b')
plt.xlabel('Estimated Probability for Being Bad')
plt.ylabel('Cumulative %')
plt.title('Kolmogorov-Smirnov:  %0.4f' %KS)
plt.show()

#### Analyze Random Forest (2)

In [None]:
val_pred = rfc.predict(X_valid)
val_result = []
for pred in val_pred :
    if pred > 0.5 :
        val_result.append(1)
    else :
        val_result.append(0)

In [None]:
print(classification_report(list(y_valid), val_result, target_names=["0", "1"]))

In [None]:
f1 = f1_score(list(y_valid), val_result)
print(f1 * 100)

In [None]:
cm = confusion_matrix(list(y_valid), val_result)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("Normalized Random Forest Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

### KNN

#### Model

In [None]:
def elbow() :
    error_rate = []
    for i in range(1,20):
        knn = KNeighborsClassifier(n_neighbors=i, n_jobs=-1)
        knn.fit(X_train,y_train)
        pred_i = knn.predict(X_valid)
        error_rate.append(np.mean(pred_i != y_valid))
    return error_rate

In [None]:
plt.figure(figsize=(9, 7))
plt.plot(range(1,20), elbow(), color="blue", linestyle="dashed", marker="o",
    markerfacecolor="red", markersize=10)
plt.title("Error Rate vs. K Value")
plt.xlabel("K")
plt.ylabel("Error Rate")
plt.xlim([0,20])
plt.show()

In [None]:
knn = KNeighborsClassifier(n_neighbors=2, n_jobs=-1)
knn.fit(X_train,y_train)

#### Analyze KNN

In [None]:
val_pred = knn.predict(X_valid)
val_result = []
for pred in val_pred :
    if pred > 0.5 :
        val_result.append(1)
    else :
        val_result.append(0)

In [None]:
print(classification_report(list(y_valid), val_result, target_names=["0", "1"]))

In [None]:
f1 = f1_score(list(y_valid), val_result)
print(f1 * 100)

In [None]:
cm = confusion_matrix(list(y_valid), val_result)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("Normalized KNN Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

### Logistic Regression

#### Model

In [None]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

#### Analyze Logistic Regression

In [None]:
val_pred = logreg.predict(X_valid)
val_result = []
for pred in val_pred :
    if pred > 0.5 :
        val_result.append(1)
    else :
        val_result.append(0)

In [None]:
print(classification_report(list(y_valid), val_result, target_names=["0", "1"]))

In [None]:
f1 = f1_score(list(y_valid), val_result)
print(f1 * 100)

In [None]:
cm = confusion_matrix(list(y_valid), val_result)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("Normalized Logistic Regression Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

## TEST DATA TIME!

In [None]:
DNN_test_predict = model.predict(X_test)
DNN_test_preds = []
for pred in DNN_test_predict :
    if pred > 0.5 :
        DNN_test_preds.append(1)
    else :
        DNN_test_preds.append(0)

In [None]:
RFC_test_predict = rfc.predict(X_test)
RFC_test_preds = []
for pred in RFC_test_predict :
    if pred > 0.5 :
        RFC_test_preds.append(1)
    else :
        RFC_test_preds.append(0)

In [None]:
KNN_test_predict = knn.predict(X_test)
KNN_test_preds = []
for pred in KNN_test_predict :
    if pred > 0.5 :
        KNN_test_preds.append(1)
    else :
        KNN_test_preds.append(0)

In [None]:
LOGREG_test_predict = logreg.predict(X_test)
LOGREG_test_preds = []
for pred in LOGREG_test_predict :
    if pred > 0.5 :
        LOGREG_test_preds.append(1)
    else :
        LOGREG_test_preds.append(0)

In [None]:
print(f"f1 score Deep Neural Network : {round(f1_score(list(y_test), DNN_test_preds), 4) * 100}%")
print(f"f1 score Random Forest Classifier : {round(f1_score(list(y_test), RFC_test_preds), 4) * 100}%")
print(f"f1 score Logistic Regression : {round(f1_score(list(y_test), LOGREG_test_preds), 4) * 100}%")
print(f"f1 score K-Nearest Neighbours : {round(f1_score(list(y_test), KNN_test_preds), 4) * 100}%")

In [None]:
cm = confusion_matrix(list(y_test), DNN_test_preds)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("TEST - Normalized Deep Neural Network Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

In [None]:
cm = confusion_matrix(list(y_test), RFC_test_preds)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("TEST - Normalized Random Forest Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

In [None]:
cm = confusion_matrix(list(y_test), LOGREG_test_preds)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("TEST - Normalized Logistic Regression Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

In [None]:
cm = confusion_matrix(list(y_test), KNN_test_preds)
cm_norm = np.round(cm / np.sum(cm, axis=1).reshape(-1,1), 2)
plt.figure(figsize=(9, 7))
sns.heatmap(
    cm_norm, 
    cmap="YlGnBu", 
    annot=True
)
plt.title("TEST - Normalized KNN Confusion Matrix",
    loc="center",
    fontweight="bold",
    size=15
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

## Kesimpulan

* Dapat disimpulkan model **Deep Neural Network** lebih andal dengan **f1 score : 56.37%** pada data test dibandingkan dengan model lainnya. Dengan tingkat false positive yang cukup rendah 0.18 dan false negative 0.42, model dapat diandalkan karena **lebih sedikit kesalahan** memprediksi perjalanan yang **sebenarnya valid dan tidak valid**.

* Sebagai alternatif, model **Random Forest Classifier** memiliki **f1 score : 54.42%** pada data test. Dengan pemodelan yang lebih ringkas namun performanya tidak jauh berbeda dengan Deep Neural Network. Model ini menghasilkan **false positive yang lebih sedikit**, yang berarti akan **minim kesalahan penagihan** dan false negative dapat diverifikasi kembali total tagihan yang harus dibayar oleh customer.

In [None]:
#!jupyter-nbconvert --to PDFviaHTML uber_trip_classification.ipynb