# DATA DEFINTTION

**Kolom pada dataset**

---
<BR><BR>1. trans_date_trans_time:<BR>Waktu transaksi. Dapat digunakan untuk mendeteksi pola waktu (misal transaksi yang sering terjadi pada waktu yang tidak wajar bisa dicurigai sebagai fraud). 
<BR><BR>2. cc_num:<BR>Nomor kartu kredit. Informasi penting terkait pengguna kartu, tetapi tidak langsung membantu mendeteksi fraud kecuali ada pola dari penggunaan kartu yang tidak biasa.
<BR><BR>3. merchant:<BR>Nama merchant (penjual). Merchant tertentu mungkin lebih sering terlibat dalam transaksi fraud, sehingga bisa membantu mendeteksi risiko terkait merchant.
<BR><BR>4. category:<BR>Kategori transaksi. Fraud mungkin lebih sering terjadi dalam kategori tertentu seperti barang mewah atau hiburan, yang memiliki nilai transaksi lebih tinggi.
<BR><BR>5. amt:<BR>Jumlah uang dalam transaksi. Nilai transaksi yang sangat besar atau tidak sesuai dengan pola belanja biasanya bisa menjadi tanda fraud.
<BR><BR>6. first:<BR>Nama depan pemegang kartu. Tidak terlalu relevan dalam mendeteksi fraud secara langsung.
<BR><BR>7. last:<BR>Nama belakang pemegang kartu. Sama seperti first, tidak relevan secara langsung.
<BR><BR>8. gender:<BR>Jenis kelamin pemegang kartu. Bisa digunakan untuk melihat pola demografis terkait fraud, meskipun tidak secara langsung mengindikasikan fraud.
<BR><BR>9. street:<BR>Alamat jalan pemegang kartu. Dapat digunakan dalam deteksi anomali jika lokasi transaksi berbeda jauh dari alamat pemegang kartu.
<BR><BR>10. city:<BR>Kota pemegang kartu. Sama seperti street, dapat digunakan untuk memeriksa ketidaksesuaian antara lokasi pemegang kartu dan transaksi.
<BR><BR>11. state:<BR>Negara bagian pemegang kartu. Sama dengan city, bisa mendeteksi anomali lokasi.
<BR><BR>12. zip:<BR>Kode pos pemegang kartu. Sama dengan city dan state, bisa membantu mendeteksi anomali geografis.
<BR><BR>13. lat:<BR>Garis lintang lokasi pemegang kartu. Lokasi geografis dapat membantu mendeteksi ketidaksesuaian jika dibandingkan dengan lokasi transaksi.
<BR><BR>14. long:<BR>Garis bujur lokasi pemegang kartu. Sama dengan lat, membantu mendeteksi lokasi.
<BR><BR>15. city_pop:<BR>Populasi kota pemegang kartu. Bisa digunakan untuk memahami risiko terkait daerah, misalnya daerah padat penduduk mungkin memiliki lebih banyak transaksi dan risiko.
<BR><BR>16. job:<BR>Pekerjaan pemegang kartu. Pekerjaan dengan penghasilan tinggi mungkin lebih rentan terhadap fraud karena lebih sering terlibat dalam transaksi besar.
<BR><BR>17. dob:<BR>Tanggal lahir pemegang kartu. Usia pemegang kartu bisa menjadi faktor, misalnya kelompok usia tertentu mungkin lebih rentan terhadap fraud.
<BR><BR>18. trans_num:<BR>ID unik untuk transaksi. Tidak relevan dalam deteksi fraud secara langsung.
<BR><BR>19. unix_time:<BR>Waktu transaksi dalam format unix. Sama seperti trans_date_trans_time, membantu dalam menganalisis pola waktu.
<BR><BR>20. merch_lat:<BR>Garis lintang merchant. Dapat digunakan untuk mendeteksi ketidaksesuaian antara lokasi merchant dan pemegang kartu.
<BR><BR>21. merch_long:<BR>Garis bujur merchant. Sama seperti merch_lat, membantu mendeteksi anomali lokasi.
<BR><BR>22. is_fraud:<BR>Label apakah transaksi adalah fraud (0 = tidak fraud, 1 = fraud). Ini adalah target yang perlu diprediksi.


# IMPORTS

In [None]:
#general imports
import pandas as pd
import numpy as np
from datetime import datetime

#for Splitting
from sklearn.model_selection import train_test_split

#for handling imbalance data
from imblearn.over_sampling import SMOTE

#for modelling
from sklearn.ensemble import AdaBoostClassifier
    # from sklearn.tree import DecisionTreeClassifier #ganti pakai random forest aja ah
from sklearn.ensemble import RandomForestClassifier

#for model evaluation
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

#for visualizations
import matplotlib.pyplot as plt
import seaborn as sns


# DATA LOADING

In [None]:

df_train = pd.read_csv('./archive/fraudTrain.csv')

#oh iya baris di bawah ini nanti dimatikan saja
#karena masih coba-coba codingnya dan di ulang, pakai 1000 baris dulu biar cepet aja
#ntar kalau mau mulai pemodelan benerannya baru  pakai full data
# df_train.sample(1000).to_csv('df_train_sampling.csv', index=False) #buang sekidikit ke csv untuk analisa ringan
# df_train=df_train.sample(1000)


In [None]:
## untuk save dan load pickle

import pickle

def save_model_to_pickle(model, filename):
    with open(filename, 'wb') as file:
        pickle.dump(model, file)
    print(f"Model saved to {filename}")

def load_model_from_pickle(filename):
    with open(filename, 'rb') as file:
        model = pickle.load(file)
    print(f"Model loaded from {filename}")
    return model




# EDA + Data Preparation

In [None]:
# miniEDAtoPDF  nyomot dari internet


import pandas as pd
import matplotlib.pyplot as plt
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

def mini_eda(df, pdf_filename):
    # Create a PDF file
    pdf = SimpleDocTemplate(pdf_filename, pagesize=letter)
    styles = getSampleStyleSheet()
    story = []

    # Function to format numbers
    def format_number(num):
        return f"{num:,.2f}"

    # Iterate through each column in the dataframe
    for col in df.columns:
        # Count of non-null rows
        non_null_count = df[col].count()
        story.append(Paragraph(f"<b>Column: {col}</b>", styles['Heading2']))
        story.append(Paragraph(f"Non-Null Rows: {format_number(non_null_count)}", styles['BodyText']))

        # Missing values
        missing_count = df[col].isnull().sum()
        missing_pct = (missing_count / len(df)) * 100
        story.append(Paragraph(f"Missing Values: {format_number(missing_count)} ({format_number(missing_pct)}%)", styles['BodyText']))

        # Distinct values
        distinct_count = df[col].nunique()
        distinct_pct = (distinct_count / len(df)) * 100
        story.append(Paragraph(f"Distinct Values: {format_number(distinct_count)} ({format_number(distinct_pct)}%)", styles['BodyText']))

        # Check if the column is numerical or categorical
        if pd.api.types.is_numeric_dtype(df[col]):
            # Numerical column
            min_val = df[col].min()
            max_val = df[col].max()
            mean_val = df[col].mean()
            q5 = df[col].quantile(0.05)
            q25 = df[col].quantile(0.25)
            q50 = df[col].quantile(0.50)
            q75 = df[col].quantile(0.75)
            q95 = df[col].quantile(0.95)
            story.append(Paragraph(f"Min: {format_number(min_val)}", styles['BodyText']))
            story.append(Paragraph(f"Max: {format_number(max_val)}", styles['BodyText']))
            story.append(Paragraph(f"Mean: {format_number(mean_val)}", styles['BodyText']))
            story.append(Paragraph(f"5th Percentile (Q5): {format_number(q5)}", styles['BodyText']))
            story.append(Paragraph(f"25th Percentile (Q25): {format_number(q25)}", styles['BodyText']))
            story.append(Paragraph(f"50th Percentile (Q50): {format_number(q50)}", styles['BodyText']))
            story.append(Paragraph(f"75th Percentile (Q75): {format_number(q75)}", styles['BodyText']))
            story.append(Paragraph(f"95th Percentile (Q95): {format_number(q95)}", styles['BodyText']))

            # Check if the column can be divided into less than 20 groups
            if distinct_count <= 20:
                # Plot distribution
                plt.figure(figsize=(10, 6))
                df[col].value_counts().sort_index().plot(kind='bar')
                plt.title(f"Distribution of {col}")
                plt.xlabel(col)
                plt.ylabel("Count")
                img_filename = f"{col}_distribution.png"
                plt.savefig(img_filename)
                plt.close()
                story.append(Spacer(1, 12))
                story.append(Image(img_filename, width=400, height=300))
            else:
                story.append(Paragraph(f"Min: {format_number(min_val)}", styles['BodyText']))
                story.append(Paragraph(f"Max: {format_number(max_val)}", styles['BodyText']))
        else:
            # Categorical column
            story.append(Paragraph(f"Distinct Values: {format_number(distinct_count)}", styles['BodyText']))

            # Top 10 + others
            top_10 = df[col].value_counts().nlargest(10)
            others_count = len(df) - top_10.sum()
            others_pct = (others_count / len(df)) * 100
            story.append(Paragraph("Top 10 + Others:", styles['BodyText']))
            for value, count in top_10.items():
                pct = (count / len(df)) * 100
                story.append(Paragraph(f"{value}: {format_number(count)} ({format_number(pct)}%)", styles['BodyText']))
            story.append(Paragraph(f"Others: {format_number(others_count)} ({format_number(others_pct)}%)", styles['BodyText']))

            # Plot distribution
            plt.figure(figsize=(10, 6))
            top_10.plot(kind='bar')
            plt.title(f"Top 10 Distribution of {col}")
            plt.xlabel(col)
            plt.ylabel("Count")
            img_filename = f"{col}_top10_distribution.png"
            plt.savefig(img_filename)
            plt.close()
            story.append(Spacer(1, 12))
            story.append(Image(img_filename, width=400, height=300))

        # Add a page break after each column
        story.append(PageBreak())

    # Build the PDF
    pdf.build(story)

# Example usage
# df = pd.read_csv('your_data.csv')
# mini_eda(df, 'eda_report.pdf')




In [None]:
mini_eda(df_train, 'eda_report__df_train.pdf')

In [None]:
df_train.isnull().sum()

In [None]:
df_train.info()

In [None]:
# tukang stop proses jupyter , uncomment to break the running process of "Run All" 

## convert date and time

In [None]:
# Convert kolom 'trans_date_trans_time' ke format datetime
df_train['trans_date_trans_time'] = pd.to_datetime(df_train['trans_date_trans_time'])

# Ambil komponen waktu dari kolom 'trans_date_trans_time'
df_train['trans_year'] = df_train['trans_date_trans_time'].dt.year
df_train['trans_month'] = df_train['trans_date_trans_time'].dt.month
df_train['trans_date'] = df_train['trans_date_trans_time'].dt.day
df_train['trans_hour'] = df_train['trans_date_trans_time'].dt.hour
df_train['trans_dow'] = df_train['trans_date_trans_time'].dt.dayofweek  # Hari dalam minggu (0 = Senin, 6 = Minggu)

# Tampilkan beberapa baris pertama buat ngecek hasilnya
df_train.sample(15)[['trans_date_trans_time', 'trans_year', 'trans_month', 'trans_date', 'trans_hour', 'trans_dow']]


In [None]:
# Convert kolom 'unix_time' ke format datetime
df_train['unix_time'] = pd.to_datetime(df_train['unix_time'], unit='s')

# Tambahin 7 tahun ke 'unix_time'
seconds_in_7_years = 7 * 365 * 24 * 60 * 60  # Hitung detik dalam 7 tahun
df_train['unix_time'] = df_train['unix_time'] + pd.Timedelta(seconds=seconds_in_7_years)

# Ambil komponen waktu dari kolom 'unix_time'
df_train['unix_year'] = df_train['unix_time'].dt.year+7
df_train['unix_month'] = df_train['unix_time'].dt.month
df_train['unix_date'] = df_train['unix_time'].dt.day
df_train['unix_hour'] = df_train['unix_time'].dt.hour
df_train['unix_dow'] = df_train['unix_time'].dt.dayofweek  # Hari dalam minggu (0 = Senin, 6 = Minggu)

# Tampilkan beberapa baris pertama buat ngecek hasilnya
df_train.sample(10)[['unix_time', 'unix_year', 'unix_month', 'unix_date', 'unix_hour', 'unix_dow']]


In [None]:
# Membandingkan trans time dan unix time
df_train.sample(15)[['trans_date_trans_time', 'trans_year', 'trans_month', 'trans_date', 'trans_hour', 'trans_dow','unix_time', 'unix_year', 'unix_month', 'unix_date', 'unix_hour', 'unix_dow']]

## menghitung selisih hari saat transaksi dan pembukuan

In [None]:
# Buat kolom 'dow_dif' selisih hari transaksi dengan pembukan
df_train['dow_dif'] = df_train.apply(
    lambda row: row['trans_dow'] - row['unix_dow'] 
    if row['trans_dow'] - row['unix_dow'] >= 0 
    else row['trans_dow'] - row['unix_dow'] + 7, axis=1)


In [None]:
# Tampilkan hasilnya
df_train.sample(30)[['unix_dow', 'trans_dow', 'dow_dif']].head()

## menghitung jarak

In [None]:
# Fungsi untuk menghitung jarak menggunakan rumus Haversine (minta sama chatgpt)

def haversine(lat1, lon1, lat2, lon2):
    # Konversi dari derajat ke radian
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    
    # Rumus Haversine
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    
    # Radius bumi dalam kilometer (bisa diganti dengan 6371 untuk km atau 3958.8 untuk miles)
    R = 6371  # Radius bumi dalam kilometer
    distance = R * c
    return distance

In [None]:
# Buat kolom 'card_merchant_distance' dengan menghitung jarak
df_train['card_merchant_distance_km'] = df_train.apply(
    lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1
)

#membuat grouping jarak per 10 km, akan digunakan untuk  membuat kelompok fraud rate
df_train['card_merchant_distance_km_grp']=(df_train['card_merchant_distance_km']//10)*10

# Tampilkan hasil
df_train.sample(10)[['lat', 'long', 'merch_lat', 'merch_long', 'card_merchant_distance_km','card_merchant_distance_km_grp']]

## menghitung umur saat transaksi

In [None]:
# Convert 'dob' ke datetime format supaya bisa diproses
df_train['dob'] = pd.to_datetime(df_train['dob'])

# Hitung umur berdasarkan selisih antara 'unix_year' dan tahun dari 'dob'
df_train['person_age'] = df_train['unix_year'] - df_train['dob'].dt.year

#membuat grouping umu per 10 tahun, akan digunakan untuk  membuat kelompok fraud rate
df_train['person_age_grp']=(df_train['person_age']//10)*10

# Tampilkan hasilnya
df_train.sample(10)[['dob', 'unix_year', 'person_age','person_age_grp']]

## fraud rate by unix_hour

In [None]:
#setiap jam di group, lalu dihitung jumlah transaksinya (jml barisnya aka size nya) 
# dan jumlah transaksi fraud nya (is_fraud==1)--(di jumlah saja)
hourly_data = df_train.groupby('unix_hour').agg(
    total_transactions=('is_fraud', 'size'),
    fraud_cases=('is_fraud', 'sum')
).reset_index()

#hitung rate nya
hourly_data['fraud_rate_by_unix_hour'] = hourly_data['fraud_cases'] / hourly_data['total_transactions']

#masukkan kembali ke dataframe utama
df_train = df_train.merge(hourly_data[['unix_hour', 'fraud_rate_by_unix_hour']], on='unix_hour', how='left')

#melihat isi df_train yang baru secara random sebanyak 10 baris
df_train.sample(10)

In [None]:
import matplotlib.pyplot as plt

# Plot the fraud rate for each hour
plt.figure(figsize=(6, 3))
plt.bar(hourly_data['unix_hour'], hourly_data['fraud_rate_by_unix_hour'], color='skyblue', edgecolor='black')
plt.title('Fraud Rate Distribution by Hour of the Day (is_fraud = 1)')
plt.xlabel('Hour of the Day (unix_hour)')
plt.ylabel('Fraud Rate')
plt.xticks(range(0, 24))  # Display hours from 0 to 23
plt.grid(True, axis='y')
plt.show()


## fraud rate by category


In [None]:
# menerapkan cara kerja yang sama, dengan "fraud rate by unix_hour"
# comment lebih lengkap ada di sana


category_data = df_train.groupby('category').agg(
    total_transactions=('is_fraud', 'size'),
    fraud_cases=('is_fraud', 'sum')
).reset_index()

category_data['fraud_rate_by_category'] = category_data['fraud_cases'] / category_data['total_transactions']

df_train = df_train.merge(category_data[['category', 'fraud_rate_by_category']], on='category', how='left')

df_train.sample(10)

In [None]:
import matplotlib.pyplot as plt

# Plot the fraud rate for each hour
plt.figure(figsize=(8, 3))
plt.bar(category_data['category'], category_data['fraud_rate_by_category'], color='skyblue', edgecolor='black')
plt.title('Fraud Rate Distribution by Hour of the Day (is_fraud = 1)')
plt.xlabel('Hour of the Day (category)')
plt.ylabel('Fraud Rate')
plt.xticks(range(0, 24), rotation=90)  # Rotate x-axis labels by 90 degrees
plt.grid(True, axis='y')
plt.show()


In [None]:
unique_categories = df_train['category'].unique()
print(unique_categories)

## fraud rate by city


In [None]:
city_data = df_train.groupby('city').agg(
    total_transactions=('city', 'count'),
    fraud_cases=('is_fraud', 'sum')
).reset_index()


city_data['fraud_rate_by_city'] = city_data['fraud_cases'] / city_data['total_transactions']

city_data.sort_values(by=['total_transactions', 'fraud_rate_by_city'], ascending=[True, True])



In [None]:
# memakai fraud rate untuk city , menyebabkan data jadi kurang tidak berimbang untuk kota kecil ataupun transaksi sedikit
# akan dilakuan mmemasukkan komponen city population dalam  perhitungn

In [None]:
city_pop_data = df_train.groupby(['city']).agg(
    mean_pop=('city_pop','mean')
).reset_index()
city_pop_data

In [None]:
print(city_data.columns)
print(city_pop_data.columns)

In [None]:
#masukkan data populasi rata2 city ke city_data
city_data  = city_data.merge(city_pop_data[['city','mean_pop']], on='city', how='left')

#add scaling untuk populasi
city_data['max_overal_pop']=city_data['mean_pop'].max()
city_data['scaled_pop']=city_data['mean_pop'] / city_data['max_overal_pop']
city_data['fraud_rate_by_city_scaled']=city_data['fraud_rate_by_city'] * city_data['scaled_pop']

city_data.sort_values(by=['total_transactions', 'fraud_rate_by_city_scaled'], ascending=[True, True])

In [None]:
city_data.describe()

In [None]:
df_train = df_train.merge(city_data[['city', 'fraud_rate_by_city_scaled']], on='city', how='left')
df_train

In [None]:
import matplotlib.pyplot as plt

# Get the top jmlnya states with the highest fraud rate
jmlnya = 10
top_rate_cities = df_train.groupby('city')['fraud_rate_by_city_scaled'].mean().sort_values(ascending=False).head(jmlnya)

# Plot the fraud rate for the top jmlnya cities
plt.figure(figsize=(12, 6))
plt.bar(top_rate_cities.index, top_rate_cities.values, color='skyblue', edgecolor='black')
plt.title('Top 10 Cities with Highest Fraud Rate')
plt.xlabel('City')
plt.ylabel('Fraud Rate')
plt.xticks(rotation=90)  # Rotate x-axis labels by 90 degrees
plt.grid(True, axis='y')
plt.show()


In [None]:
#tukang stop proses jupyter , uncomment to break the running process of "Run All" 


In [None]:
category_stats = df_train.groupby('category')['amt'].describe(percentiles=[.05, .95])
category_stats = category_stats.rename(columns={'5%': 'cat_amt_5pctl', '95%': 'cat_amt_95pctl'})
category_stats

## flagging outlier untuk Category VS Amount

In [None]:
#join ke df_train
df_train= df_train.merge(category_stats[['cat_amt_5pctl','cat_amt_95pctl']], on='category', how = 'left')
df_train['category_lower_outlier']=np.where(df_train['amt'] < df_train['cat_amt_5pctl'],1,0)
df_train['category_upper_outlier']=np.where(df_train['amt'] > df_train['cat_amt_95pctl'],1,0)
#drop kolom bantuan 
df_train=df_train.drop('cat_amt_5pctl', axis=1)
df_train=df_train.drop('cat_amt_95pctl', axis=1)

df_train

In [None]:
job_stats = df_train.groupby('job')['amt'].describe(percentiles=[.05, .95])
job_stats = job_stats.rename(columns={'5%': 'job_amt_5pctl', '95%': 'job_amt_95pctl'})
job_stats

## flagging outlier untuk Job VS Amount

In [None]:
#join ke df_train
df_train= df_train.merge(job_stats[['job_amt_5pctl','job_amt_95pctl']], on='job', how = 'left')
df_train['job_lower_outlier']=np.where(df_train['amt'] < df_train['job_amt_5pctl'],1,0)
df_train['job_upper_outlier']=np.where(df_train['amt'] > df_train['job_amt_95pctl'],1,0)
#drop kolom bantuan 
df_train=df_train.drop('job_amt_5pctl', axis=1)
df_train=df_train.drop('job_amt_95pctl', axis=1)
df_train

TO DO yang belon dikerjakan untuk EDA di atas:

- apa perlu flagging highrisk juga semacam jarak vs isFraud == 1 ?  >>>>> DONE di bawah
- apa perlu flagging highrisk juga semacam umur vs isFraud == 1 ?  >>>>> DONE di bawah
- apa perlu flagging highrisk juga semacam trxtime_bookingtime_diff vs isFraud == 1 ?  >>>>> DONE di bawah

- CC usage count ( kecil2 tapi banyak atau sekali tapi langsung big amount) ---- gimana cara modeling data / chart nya ? (mungkin bisa pakai cummulative trx count dan avarage amount per CC per bulan bersangkutan ?) >>>>> SKIP gak ada waktu

## fraud rate by umur group

In [None]:
# menerapkan cara kerja yang sama, dengan "fraud rate by unix_hour"
# comment lebih lengkap ada di sana

umur_data = df_train.groupby('person_age_grp').agg(
    total_transactions=('is_fraud', 'size'),
    fraud_cases=('is_fraud', 'sum')
).reset_index()

umur_data['fraud_rate_by_umur'] = umur_data['fraud_cases'] / umur_data['total_transactions']

df_train = df_train.merge(umur_data[['person_age_grp', 'fraud_rate_by_umur']], on='person_age_grp', how='left')

df_train.sample(10)

In [None]:
plt.figure(figsize=(12, 8))
plt.bar(umur_data['person_age_grp'], umur_data['fraud_rate_by_umur'], color='skyblue', edgecolor='black', alpha=0.7)

plt.title('Fraud Rate by Umur Group', fontsize=16)
plt.xlabel('Card Merchant Umur Group (person_age_grp)', fontsize=14)
plt.ylabel('Fraud Rate by Umur', fontsize=14)

plt.xticks(rotation=45, ha='right')

plt.grid(True)

plt.show()

## fraud rate by jarak group

In [None]:
# menerapkan cara kerja yang sama, dengan "fraud rate by unix_hour"
# comment lebih lengkap ada di sana

distance_data = df_train.groupby('card_merchant_distance_km_grp').agg(
    total_transactions=('is_fraud', 'size'),
    fraud_cases=('is_fraud', 'sum')
).reset_index()

distance_data['fraud_rate_by_distance'] = distance_data['fraud_cases'] / distance_data['total_transactions']

df_train = df_train.merge(distance_data[['card_merchant_distance_km_grp', 'fraud_rate_by_distance']], on='card_merchant_distance_km_grp', how='left')

df_train.sample(10)

In [None]:
plt.figure(figsize=(12, 8))
plt.bar(distance_data['card_merchant_distance_km_grp'], distance_data['fraud_rate_by_distance'], color='skyblue', edgecolor='black', alpha=0.7)

plt.title('Fraud Rate by Distance Group', fontsize=16)
plt.xlabel('Card Merchant Distance Group (card_merchant_distance_km_grp)', fontsize=14)
plt.ylabel('Fraud Rate by Distance', fontsize=14)

plt.xticks(rotation=45, ha='right')

plt.grid(True)

plt.show()

## fraud rate by dow_diff (selisih transaksi vs  pembukuan)

In [None]:
# menerapkan cara kerja yang sama, dengan "fraud rate by unix_hour"
# comment lebih lengkap ada di sana

day_diff_data = df_train.groupby('dow_dif').agg(
    total_transactions=('is_fraud', 'size'),
    fraud_cases=('is_fraud', 'sum')
).reset_index()

day_diff_data['fraud_rate_by_dow_diff'] = day_diff_data['fraud_cases'] / day_diff_data['total_transactions']

df_train = df_train.merge(day_diff_data[['dow_dif', 'fraud_rate_by_dow_diff']], on='dow_dif', how='left')

df_train.sample(10)

In [None]:
plt.figure(figsize=(12, 8))
plt.bar(day_diff_data['dow_dif'], day_diff_data['fraud_rate_by_dow_diff'], color='skyblue', edgecolor='black', alpha=0.7)

plt.title('Fraud Rate by DoW_Diff Group', fontsize=16)
plt.xlabel('Card Merchant DoW_Diff Group (dow_dif)', fontsize=14)
plt.ylabel('Fraud Rate by DoW_Diff', fontsize=14)

plt.xticks(rotation=45, ha='right')

plt.grid(True)

plt.show()

## dummy var untuk Gender

In [None]:

df_train = df_train.copy(deep=True)

#KARENA SUDAH PAKAI FLAGGING 'risk_level_of_category' , tidak perlu dijadikan variable dummy (label encoding ataupun OHE)
    # #copy data dulu untuk keperluan EDA (e.g. check missng value, etc)
    # df_train['category_ori'] = df_train['category']
    # #membuath dummy variable (OHE)
    # df_train = pd.get_dummies(df_train, columns=['category'], drop_first=True)


#copy data dulu untuk keperluan EDA (e.g. check missng value, etc)
# df_train['gender_ori'] = df_train['gender']

#membuath dummy variable (OHE)
df_train = pd.get_dummies(df_train, columns=['gender'], drop_first=True)

# df_train.sample(5)
# df_train.sample(30).to_csv('df_train_encoded_sample.csv', index=False)

In [None]:
df_train.columns

In [None]:
mini_eda(df_train, 'eda_report__df_train_withDummies.pdf')

In [None]:
# tukang stop proses jupyter , uncomment to break the running process of "Run All" 

In [None]:
# tukang stop proses jupyter , uncomment to break the running process of "Run All" 

# SPLIT

In [None]:
#split antara feature dan target 

X = df_train.drop('is_fraud', axis=1)
y = df_train['is_fraud']


# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split training data into training and validation sets (80% train, 20% test)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# X_train dan y_train digabung lagi dulu karen akan dilakukan handling imbalance data 
df_train_8080 = pd.concat([X_train, y_train], axis=1)
df_val_8020 = pd.concat([X_val, y_val], axis=1)
df_test_20 = pd.concat([X_test, y_test], axis=1)

# Optionally, check the shape of the split data
print(f"Training features shape: {X_train.shape}")
print(f"Validation features shape: {X_val.shape}")
print(f"Test features shape: {X_test.shape}")


## Handling Imbalanced Data

In [None]:

target_distribution = df_train_8080['is_fraud'].value_counts()
print(target_distribution)
print('-------------------')
print(target_distribution / target_distribution.sum() * 100)
print('-------------------')

fraud_cases = target_distribution[1]  
non_fraud_cases = target_distribution[0]  
imbalance_ratio = non_fraud_cases / fraud_cases
print(f"Imbalance ratio (Non-fraud / Fraud): {imbalance_ratio:.2f}")
print('-------------------')


In [None]:
#install dulu kalau belum ada di env kita
#!pip install imbalanced-learn

In [None]:
df_train_8080.columns

## Correlation Map

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
corr_matrix = df_train_8080.corr()

# Plot the heatmap
plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


In [None]:
#memilih feature secara manual utuk feature2 yang 

list_of_selected_field=[
# 'Unnamed:0',
# 'trans_date_trans_time',
# 'cc_num',
# 'merchant',
# 'category',
'amt',
# 'first',
# 'last',
# 'street',
# 'city',
# 'state',
# 'zip',
# 'lat',
# 'long',
'city_pop',
# 'job',
# 'dob',
# 'trans_num',
# 'unix_time',
# 'merch_lat',
# 'merch_long',
'is_fraud',
# 'trans_year',
# 'trans_month',
# 'trans_date',
# 'trans_hour',
# 'trans_dow',
'unix_year',
'unix_month',
'unix_date',
'unix_hour',
'unix_dow',
'dow_dif',
# 'card_merchant_distance_km',
'card_merchant_distance_km_grp',
# 'person_age',
'person_age_grp',
'fraud_rate_by_unix_hour',
'fraud_rate_by_category',
'fraud_rate_by_city_scaled',
'category_lower_outlier',
'category_upper_outlier',
'job_lower_outlier',
'job_upper_outlier',
'fraud_rate_by_umur',
'fraud_rate_by_distance',
'fraud_rate_by_dow_diff',
'gender_M'
]

## Splitted Data and Selected Features

In [None]:
df_train_8080_selected = df_train_8080[list_of_selected_field]
df_val_8020_selected = df_val_8020[list_of_selected_field]
df_test_20_selected= df_test_20[list_of_selected_field]

### melihat correlation map untuk data training yang featurenya sudah dipilih

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
corr_matrix = df_train_8080_selected.corr()

# Plot the heatmap
plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
X = df_train_8080_selected.drop('is_fraud', axis=1)
y = df_train_8080_selected['is_fraud']

smote = SMOTE(random_state=42)
X_smoted, y_smoted = smote.fit_resample(X, y)

df_train_balanced_smote = pd.DataFrame(X_smoted, columns=X.columns)
df_train_balanced_smote['is_fraud'] = y_smoted


In [None]:
target_distribution = df_train_balanced_smote['is_fraud'].value_counts()
print(target_distribution)
print('-------------------')
print(target_distribution / target_distribution.sum() * 100)
print('-------------------')

fraud_cases = target_distribution[1]  
non_fraud_cases = target_distribution[0]  
imbalance_ratio = non_fraud_cases / fraud_cases
print(f"Imbalance ratio (Non-fraud / Fraud): {imbalance_ratio:.2f}")
print('-------------------')

====================================================================<br>
disini data sudah di split dan sudah dipilih beberapa feature yang akan dipakai (manual)<br>
tinggal memilih model yang ingin dipkai<br>
====================================================================<br>
data yang siap digunakan :<br>
- df_train_balanced_smote : untuk data training
- df_val_8020_selected  : untuk data validasi
- df_test_20_selected : untuk data test


# ADABOOST MODELING

## Define Adaboost Model

In [None]:
#
base_estimator = RandomForestClassifier() # tidak ada parameter di set , saya ambil default saja semua 

# Initialize AdaBoost with base estimator and n_estimators (number of trees)
model = AdaBoostClassifier(
        estimator=base_estimator, 
        n_estimators=100, 
        random_state=42)


## Model Fitting

### Akan Fitting Model, [AWAS LAMA!!!!!]

 skip saja kalau udah pernah , langsung load saja di bawah

In [None]:
X = df_train_balanced_smote.drop('is_fraud', axis=1)
y = df_train_balanced_smote['is_fraud']

In [None]:
model.fit(X, y)

### Save/Load Model to/from File (bisa di skip kalau tidak perlu)

In [None]:
#save modelnya dulu 
nama_filenya = 'model_adaboost_RandomForestClassifier_n100.pkl'
timestamp = datetime.now().strftime('%Y%m%d_%H%M')

nama_filenya = nama_filenya.replace('.pkl', f'_{timestamp}.pkl')
save_model_to_pickle(model,nama_filenya)

In [None]:
# #load modelnya 
# nama_filenya = 'model_adaboost_RandomForestClassifier_n100_20250227_0613.pkl' #<<< ganti nama file nya sesuai dengan model yang ingin di load
model = load_model_from_pickle(nama_filenya)

## Checking Feature Importance

In [None]:
feature_importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': X.columns, 
    'Importance': feature_importances
})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)


In [None]:
feature_importance_df

dari feature importance di atas, kita akan coba ambil lagi natni buat model ke 2 yang menggunakan feature2 yang importancenya 1% ke atas

## Predict Validation Data

In [None]:
X_val = df_val_8020_selected.drop('is_fraud', axis=1)
y_val = df_val_8020_selected['is_fraud']

# Predict on the test set
y_pred = model.predict(X_val)

# Predict probabilities for ROC-AUC evaluation
y_pred_prob = model.predict_proba(X_val)[:, 1]

## Evaluate Predicted Validation Data

In [None]:
# Print classification report (precision, recall, f1-score)
print("Classification Report:\n", classification_report(y_val, y_pred))
print('---------------------------------------------------')
# Confusion matrix to understand true positives, false positives, etc.
cm = confusion_matrix(y_val, y_pred)
cm_df = pd.DataFrame(cm, index=['Actual Non-Fraud', 'Actual Fraud'], columns=['Predicted Non-Fraud', 'Predicted Fraud'])
print("Confusion Matrix:\n", cm_df )
print('---------------------------------------------------')
# Calculate ROC-AUC score for model's ability to distinguish between classes
roc_auc = roc_auc_score(y_val, y_pred_prob)
print("ROC-AUC Score:", roc_auc)
print('---------------------------------------------------')


## Prepare Test Data

In [None]:
X_test = df_test_20_selected.drop('is_fraud', axis=1)
y_test = df_test_20_selected['is_fraud']


## Predicted Test Data

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

# Predict probabilities for ROC-AUC evaluation 
y_pred_prob = model.predict_proba(X_test)[:, 1]

## Evaluate Test Data Prediction

In [None]:
# Print classification report (precision, recall, f1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))
print('---------------------------------------------------')
# Confusion matrix to understand true positives, false positives, etc.
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index=['Actual Non-Fraud', 'Actual Fraud'], columns=['Predicted Non-Fraud', 'Predicted Fraud'])
print("Confusion Matrix:\n", cm_df )
print('---------------------------------------------------')
# Calculate ROC-AUC score for model's ability to distinguish between classes
roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC-AUC Score:", roc_auc)
print('---------------------------------------------------')

simpan report classification untuk pembandingan di bagian paling akhir

In [None]:
summaryAkhir = 'Classification Report untuk Test Data menggunakan model dengan feature pilihan manual\n'
summaryAkhir = summaryAkhir + classification_report(y_test, y_pred)
summaryAkhir = summaryAkhir +'----------------------------------------------------------\n'

# ADABOOST MODELING (LAGI)
(kali ini akan dicoba menggunakan feature yang kontribusinya lebih dari 1% saja berdasar feature importance)

## Feature Selection based on Feature Importance

In [None]:
#memilih feature2 yang kontribusinya >= 1% saja
list_of_selected_field2 = feature_importance_df[feature_importance_df['Importance']>=0.01]
list_of_selected_field2

In [None]:
#list feature feature pilihan berdasar feature importance
importantFeatures = list(list_of_selected_field2['Feature'].unique()) 
importantFeatures_and_target = importantFeatures + ['is_fraud']

df_val_8020_penting = df_val_8020_selected[importantFeatures_and_target]
df_test_20_penting= df_test_20_selected[importantFeatures_and_target]
df_train_balanced_smote_penting = df_train_balanced_smote[importantFeatures_and_target]

====================================================================<br>
disini data sudah di split dan sudah dipilih beberapa feature berdasar feature importance<br>
tinggal memilih model yang ingin dipkai<br>
====================================================================<br>
data yang siap digunakan :<br>
- df_train_balanced_smote_penting : untuk data training
- df_val_8020_penting  : untuk data validasi
- df_test_20_penting : untuk data test

## Model Fitting

### awas lama looh di sini (about 15 minutes on pc)

In [None]:
X = df_train_balanced_smote_penting[importantFeatures]
y = df_train_balanced_smote_penting['is_fraud']

model.fit(X, y)

## Save / Load Model

In [None]:
#save modelnya dulu 
nama_filenya = 'model_adaboost_RandomForestClassifier_penting.pkl'
timestamp = datetime.now().strftime('%Y%m%d_%H%M')

nama_filenya = nama_filenya.replace('.pkl', f'_{timestamp}.pkl')
save_model_to_pickle(model,nama_filenya)

In [None]:
#nama_filenya = 'model_adaboost_RandomForestClassifier_featured.pkl' #ganti dengan nama file lain yang dinginkan
model = load_model_from_pickle(nama_filenya)

## Predict Validation Data

In [None]:
X_val = df_val_8020_penting.drop('is_fraud', axis=1)
y_val = df_val_8020_penting['is_fraud']

## Predict on the test set
y_pred = model.predict(X_val)

## Predict probabilities for ROC-AUC evaluation
y_pred_prob = model.predict_proba(X_val)[:, 1]

## Evaluate Predicted Validation Data

In [None]:
print(f'y_val values :\n{y_val.value_counts()}')
y_pred_series = pd.Series(y_pred)
print(f'y_pred values :\n{y_pred_series.value_counts()}')

In [None]:
# Print classification report (precision, recall, f1-score)
print("Classification Report:\n", classification_report(y_val, y_pred))
print('---------------------------------------------------')
# Confusion matrix to understand true positives, false positives, etc.
cm = confusion_matrix(y_val, y_pred)
cm_df = pd.DataFrame(cm, index=['Actual Non-Fraud', 'Actual Fraud'], columns=['Predicted Non-Fraud', 'Predicted Fraud'])
print("Confusion Matrix:\n", cm_df )
print('---------------------------------------------------')
# Calculate ROC-AUC score for model's ability to distinguish between classes
roc_auc = roc_auc_score(y_val, y_pred_prob)
print("ROC-AUC Score:", roc_auc)
print('---------------------------------------------------')

## Predicted Test Data

In [None]:
X_test = df_test_20_penting.drop('is_fraud', axis=1)
y_test = df_test_20_penting['is_fraud']

# Predict on the test set
y_pred = model.predict(X_test)

# Predict probabilities for ROC-AUC evaluation 
y_pred_prob = model.predict_proba(X_test)[:, 1]

## Evaluate Test Data Prediction

In [None]:
# Print classification report (precision, recall, f1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))
print('---------------------------------------------------')
# Confusion matrix to understand true positives, false positives, etc.
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index=['Actual Non-Fraud', 'Actual Fraud'], columns=['Predicted Non-Fraud', 'Predicted Fraud'])
print("Confusion Matrix:\n", cm_df )
print('---------------------------------------------------')
# Calculate ROC-AUC score for model's ability to distinguish between classes
roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC-AUC Score:", roc_auc)
print('---------------------------------------------------')

# COMPARE RESULT

In [None]:
summaryAkhir = summaryAkhir + 'Classification Report untuk Test Data menggunakan model dengan important feature saja\n'
summaryAkhir = summaryAkhir + classification_report(y_test, y_pred)
summaryAkhir = summaryAkhir +'----------------------------------------------------------\n'

print(summaryAkhir)

# BARIS PALING BAWAH SAAT INI