## Task 1: Data Analysis and Preprocessing

The objective of this task is to prepare a clean, feature-rich dataset that is ready
for fraud detection modeling.

This includes:
- Exploring and understanding the data (EDA)
- Integrating geolocation information using IP addresses
- Engineering meaningful behavioral and time-based features
- Transforming data into a machine-learning-ready format
- Handling severe class imbalance using appropriate techniques

Because fraud datasets are highly imbalanced, special care is taken to evaluate
class distributions and apply resampling **only on training data**.


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    average_precision_score,
    precision_recall_curve
)

from imblearn.over_sampling import SMOTE


In [2]:
fraud_df = pd.read_csv("../data/raw/Fraud_Data.csv")
ip_df = pd.read_csv("../data/raw/IpAddress_to_Country.csv")

print(fraud_df.shape)
print(ip_df.shape)

fraud_df.head()


(151112, 11)
(138846, 3)


Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0


In [3]:
ip_df.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


### Why convert IP addresses to integers?

IP addresses are stored as numbers in this dataset.
To compare a transaction IP with an IP range, both values
must be numeric and comparable.

We convert:
- `ip_address` in fraud_df
- `lower_bound_ip_address` and `upper_bound_ip_address` in ip_df


In [None]:
fraud_df['ip_int'] = fraud_df['ip_address'].astype('int64')

ip_df['lower_bound_ip_address'] = ip_df['lower_bound_ip_address'].astype('int64')
ip_df['upper_bound_ip_address'] = ip_df['upper_bound_ip_address'].astype('int64')



### IP-to-Country Mapping Logic

For each transaction IP:
- Find the IP range where:
  lower_bound ≤ ip ≤ upper_bound
- Assign the corresponding country
- If no match is found, label it as "Unknown"




In [5]:
def get_country(ip_int):
    row = ip_df[
        (ip_df['lower_bound_ip_address'] <= ip_int) &
        (ip_df['upper_bound_ip_address'] >= ip_int)
    ]
    return row.iloc[0]['country'] if not row.empty else 'Unknown'

fraud_df['country'] = fraud_df['ip_int'].apply(get_country)


# analyze fraud behavior geographically:
- Number of transactions per country
- Fraud rate per country (mean of class)


In [6]:
transaction_counts = fraud_df['country'].value_counts()

fraud_rate = (
    fraud_df
    .groupby('country')['class']
    .mean()
    .sort_values(ascending=False)
)

country_summary = pd.DataFrame({
    'transactions': transaction_counts,
    'fraud_rate': fraud_rate
}).sort_values(by='fraud_rate', ascending=False)

country_summary.head(20)


Unnamed: 0_level_0,transactions,fraud_rate
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Turkmenistan,1,1.0
Namibia,23,0.434783
Sri Lanka,31,0.419355
Luxembourg,72,0.388889
Virgin Islands (U.S.),3,0.333333
Ecuador,106,0.264151
Tunisia,118,0.262712
Peru,119,0.260504
Bolivia,53,0.245283
Kuwait,90,0.233333


# Time-based features

In [7]:
fraud_df['signup_time'] = pd.to_datetime(fraud_df['signup_time'])
fraud_df['purchase_time'] = pd.to_datetime(fraud_df['purchase_time'])

fraud_df['hour_of_day'] = fraud_df['purchase_time'].dt.hour
fraud_df['day_of_week'] = fraud_df['purchase_time'].dt.dayofweek

fraud_df['time_since_signup'] = (
    fraud_df['purchase_time'] - fraud_df['signup_time']
).dt.total_seconds()


# Transaction frequency / velocity

In [8]:
transaction_counts_user = (
    fraud_df
    .groupby('user_id')
    .size()
    .rename('transactions_per_user')
)

fraud_df = fraud_df.merge(
    transaction_counts_user,
    on='user_id',
    how='left'
)


# Data Transformation

In [9]:
target = 'class'

num_features = [
    'purchase_value',
    'age',
    'hour_of_day',
    'day_of_week',
    'time_since_signup',
    'transactions_per_user'
]

cat_features = [
    'source',
    'browser',
    'sex',
    'country'
]

X = fraud_df[num_features + cat_features]
y = fraud_df[target]


# Preprocessing Pipelin

In [10]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)
    ]
)


# Train/test split

In [11]:
from sklearn.model_selection import train_test_split
from collections import Counter

X = fraud_df.drop(columns=['class'])
y = fraud_df['class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print("Class distribution BEFORE resampling (Fraud_Data):")
print(Counter(y_train))


Class distribution BEFORE resampling (Fraud_Data):
Counter({0: 109568, 1: 11321})


In [15]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(
    X_train_processed, y_train
)


In [16]:


print("Class distribution AFTER SMOTE (Fraud_Data):")
print(Counter(y_train_smote))


Class distribution AFTER SMOTE (Fraud_Data):
Counter({0: 109568, 1: 109568})


In [17]:
from sklearn.linear_model import LogisticRegression

smote_model = LogisticRegression(max_iter=1000)
smote_model.fit(X_train_smote, y_train_smote)


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [18]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_smote = smote_model.predict(X_test_processed)

print(confusion_matrix(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote))


[[17785  9608]
 [  874  1956]]
              precision    recall  f1-score   support

           0       0.95      0.65      0.77     27393
           1       0.17      0.69      0.27      2830

    accuracy                           0.65     30223
   macro avg       0.56      0.67      0.52     30223
weighted avg       0.88      0.65      0.73     30223

