# <center>Credit Card Fraud Detection</center>

## Mini project Objective
Based on the input dataset Creadit Card Fraud Detection: Is the transaction fraudulent or not by using NumPy

---

## Student Information

**Student:**
- Full Name: Cao Trần Bá Đạt
- Student ID: 23127168

**Class:** 23KHDL

## <center>PREPROCESSING</center>

### 1. Library

In [1]:
import numpy as np

### 2. Read file

In [2]:
path = '../data/raw/creditcard.csv'
df = np.genfromtxt(path, delimiter=',', skip_header = 1, dtype=str)

print("="*60)
print("DATASET LOADED")
print("="*60)
print(f"Shape: {df.shape}")
print(f"  Rows: {df.shape[0]} transactions")
print(f"  Columns: {df.shape[1]}") 

columns = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
           'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
           'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']

print("Columns names:")
for col in columns:
    print(f" - {col}")

DATASET LOADED
Shape: (284807, 31)
  Rows: 284807 transactions
  Columns: 31
Columns names:
 - Time
 - V1
 - V2
 - V3
 - V4
 - V5
 - V6
 - V7
 - V8
 - V9
 - V10
 - V11
 - V12
 - V13
 - V14
 - V15
 - V16
 - V17
 - V18
 - V19
 - V20
 - V21
 - V22
 - V23
 - V24
 - V25
 - V26
 - V27
 - V28
 - Amount
 - Class


### 3. Preprocessing

#### 3.1 Handle datatype in column "Class"

In [3]:
df[:,-1] = np.char.strip(df[:,-1], '""')  # Remove any trailing spaces in the 'Class' column

df = df.astype(np.float32)

#### 3.2 Get files X and Y

In [4]:
X = df[:,:-1]   
y = df[:,-1]

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (284807, 30)
y shape: (284807,)


#### 3.3 Handling NaN positions

In [5]:
col_medians = np.nanmedian(X, axis=0) 
X = np.where(np.isnan(X), col_medians, X)

#### 3.4 Handling amount column outliers using percentile clipping (1%–99%)

In [6]:
amount_idx = X.shape[1] - 1
amount = X[:, amount_idx]

lower = np.percentile(amount, 1)
upper = np.percentile(amount, 99)

X[:, amount_idx] = np.clip(amount, lower, upper)

#### 3.5 Standardizing features using Z-Score normalization

In [7]:
mean = X.mean(axis=0)               # shape (d,)
std = X.std(axis=0) + 1e-8         # avoid division by zero
X_scaled = (X - mean) / std        # broadcasting -> shape (n, d)

#### 3.6 Split data into test set and train set

In [8]:
test_size = 0.2
rng = np.random.default_rng(42)
idx = np.arange(X_scaled.shape[0])
rng.shuffle(idx)

split = int(len(idx) * (1 - test_size))
train_idx = idx[:split]
test_idx  = idx[split:]

X_train = X_scaled[train_idx]
X_test  = X_scaled[test_idx]
y_train = y[train_idx]
y_test  = y[test_idx]

#### 3.7 Handling class imbalance using class weights

In [9]:
# pos_weight = (#neg / #pos)
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)

# avoid division by zero 
if n_pos == 0:
    pos_weight = 1.0
else:
    pos_weight = (n_neg / n_pos)

# sample weights vectorized
sample_weights = np.where(y_train == 1, pos_weight, 1.0)   # shape (n_train,)

#### 3.8 Write to a new data file to perform the modeling step

In [10]:
OUT_PATH = '../data/processed/creditcard_preprocessed.npz'

np.savez(
    OUT_PATH,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    mean=mean,
    std=std,
    col_medians=col_medians,
    pos_weight=pos_weight
)

print("Preprocessing finished. Saved to:", OUT_PATH)
print("X_train shape:", X_train.shape, "X_test shape:", X_test.shape)
print("pos_weight:", pos_weight)

Preprocessing finished. Saved to: ../data/processed/creditcard_preprocessed.npz
X_train shape: (227845, 30) X_test shape: (56962, 30)
pos_weight: 589.2720207253886
