# 🧪 Walmart Fraud Detection System - Prototype
This notebook demonstrates how to detect fraudulent transactions using machine learning (Random Forest/XGBoost) on a simulated retail dataset.

In [None]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE

In [None]:
# Step 2: Create simulated dataset
np.random.seed(42)
n_samples = 10000
data = pd.DataFrame({
    'TransactionID': np.arange(n_samples),
    'UserID': np.random.randint(1000, 2000, size=n_samples),
    'Amount': np.random.exponential(scale=100, size=n_samples),
    'DeviceID': np.random.choice(['mobile', 'desktop', 'tablet'], size=n_samples),
    'ReturnCount': np.random.poisson(1, size=n_samples),
    'CouponUsed': np.random.randint(0, 2, size=n_samples),
    'IsFraud': np.random.choice([0, 1], size=n_samples, p=[0.98, 0.02])
})

In [None]:
# Step 3: Preprocessing
le = LabelEncoder()
data['DeviceID'] = le.fit_transform(data['DeviceID'])
X = data.drop(['TransactionID', 'UserID', 'IsFraud'], axis=1)
y = data['IsFraud']

# Balance the dataset using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

In [None]:
# Step 4: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)

In [None]:
# Step 5: Train Random Forest
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

In [None]:
# Step 6: Evaluate the model
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('ROC AUC Score:', roc_auc_score(y_test, y_pred))

[[2406  529]
 [ 420 2523]]
              precision    recall  f1-score   support

           0       0.85      0.82      0.84      2935
           1       0.83      0.86      0.84      2943

    accuracy                           0.84      5878
   macro avg       0.84      0.84      0.84      5878
weighted avg       0.84      0.84      0.84      5878

ROC AUC Score: 0.8385249901449519


In [None]:
# Step 7: Simulate new data for testing
np.random.seed(100)  # Use a different seed for new data
n_new_samples = 100
new_data = pd.DataFrame({
    'TransactionID': np.arange(n_samples, n_samples + n_new_samples),
    'UserID': np.random.randint(1000, 2000, size=n_new_samples),
    'Amount': np.random.exponential(scale=100, size=n_new_samples),
    'DeviceID': np.random.choice(['mobile', 'desktop', 'tablet'], size=n_new_samples),
    'ReturnCount': np.random.poisson(1, size=n_new_samples),
    'CouponUsed': np.random.randint(0, 2, size=n_new_samples),
    'IsFraud': np.random.choice([0, 1], size=n_new_samples, p=[0.95, 0.05]) # Slightly higher fraud rate for simulation
})

# Preprocess the new data
new_data['DeviceID'] = le.transform(new_data['DeviceID'])
X_new = new_data.drop(['TransactionID', 'UserID', 'IsFraud'], axis=1)
y_new = new_data['IsFraud']

In [None]:
# Step 8: Predict on new data
y_new_pred = clf.predict(X_new)

# Evaluate the model on new data
print("Evaluation on New Simulated Data:")
print(confusion_matrix(y_new, y_new_pred))
print(classification_report(y_new, y_new_pred))
print('ROC AUC Score:', roc_auc_score(y_new, y_new_pred))

Evaluation on New Simulated Data:
[[72 22]
 [ 6  0]]
              precision    recall  f1-score   support

           0       0.92      0.77      0.84        94
           1       0.00      0.00      0.00         6

    accuracy                           0.72       100
   macro avg       0.46      0.38      0.42       100
weighted avg       0.87      0.72      0.79       100

ROC AUC Score: 0.3829787234042553


# Task
Create a Streamlit application that loads the previously trained RandomForestClassifier model and LabelEncoder, allows users to input transaction details (Amount, DeviceID, ReturnCount, CouponUsed), predicts whether the transaction is fraudulent using the loaded model, and displays the prediction result. Provide instructions on how to run the Streamlit application.

## Save the trained model and preprocessing objects

### Subtask:
Save the trained `RandomForestClassifier` model and the `LabelEncoder` object to files so they can be loaded into the Streamlit application.


**Reasoning**:
Save the trained model and the label encoder to files using joblib.



In [None]:
import joblib

# Save the trained model
joblib.dump(clf, 'random_forest_model.joblib')

# Save the fitted LabelEncoder
joblib.dump(le, 'label_encoder.joblib')

['label_encoder.joblib']

## Create a streamlit application file

### Subtask:
Create a Python file (e.g., `app.py`) that will contain the Streamlit code.


**Reasoning**:
Create an empty file named `app.py` in the current working directory.



In [None]:
# This is an empty file for the Streamlit application.
# The content will be added in the next step.

## Load the model and preprocessing objects in the streamlit app

### Subtask:
Load the saved model and `LabelEncoder` into the Streamlit application.


**Reasoning**:
Load the saved model and LabelEncoder into the Streamlit application using joblib.



In [None]:
import streamlit as st
import joblib

# Load the trained model
clf = joblib.load('random_forest_model.joblib')

# Load the fitted LabelEncoder
le = joblib.load('label_encoder.joblib')

st.title('Walmart Fraud Detection System')



DeltaGenerator()

**Reasoning**:
The previous command failed because the `streamlit` library is not installed. Install `streamlit` using pip.



In [None]:
!pip install streamlit joblib

Collecting streamlit
  Downloading streamlit-1.45.1-py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.45.1-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m64.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m103.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hIns

**Reasoning**:
Now that `streamlit` and `joblib` are installed, attempt to load the saved model and LabelEncoder again.



In [None]:
import streamlit as st
import joblib

# Load the trained model
clf = joblib.load('random_forest_model.joblib')

# Load the fitted LabelEncoder
le = joblib.load('label_encoder.joblib')

st.title('Walmart Fraud Detection System')



DeltaGenerator()

## Design the streamlit interface

### Subtask:
Create input fields in the Streamlit app for users to enter transaction details (Amount, DeviceID, ReturnCount, CouponUsed).


**Reasoning**:
Add input fields for the transaction details in the Streamlit app.



In [None]:
st.header('Enter Transaction Details:')

amount = st.number_input('Amount', min_value=0.0, value=100.0)
device_id = st.selectbox('Device Type', ['mobile', 'desktop', 'tablet'])
return_count = st.number_input('Return Count', min_value=0, value=0, step=1)
coupon_used = st.radio('Coupon Used', [0, 1], format_func=lambda x: 'Yes' if x == 1 else 'No')



## Implement prediction logic in the streamlit app

### Subtask:
Write code to take the user's input, preprocess it using the loaded `LabelEncoder`, and use the loaded model to predict whether the transaction is fraudulent.


**Reasoning**:
Write the code to create a button, collect user input, preprocess it, make a prediction using the loaded model, and display the result.



In [None]:
if st.button('Predict Fraud'):
    # Create a DataFrame from user input
    input_data = pd.DataFrame({
        'Amount': [amount],
        'DeviceID': [device_id],
        'ReturnCount': [return_count],
        'CouponUsed': [coupon_used]
    })

    # Preprocess the 'DeviceID'
    input_data['DeviceID'] = le.transform(input_data['DeviceID'])

    # Make prediction
    prediction_proba = clf.predict_proba(input_data)
    fraud_probability = prediction_proba[0][1] # Probability of class 1 (fraud)

    # Determine the predicted class
    predicted_class = 1 if fraud_probability >= 0.5 else 0

    st.subheader('Prediction Result:')
    if predicted_class == 1:
        st.error(f'Fraudulent Transaction Predicted! Probability: {fraud_probability:.2f}')
    else:
        st.success(f'Transaction is Likely Not Fraudulent. Probability: {fraud_probability:.2f}')



In [31]:
pip install joblib



## Add instructions to run the streamlit app

### Subtask:
Add instructions on how to save the Streamlit app file and run it from the terminal.


**Reasoning**:
Add a markdown cell with instructions on how to save and run the Streamlit application.



In [None]:
%%writefile app.py
import streamlit as st
import joblib
import pandas as pd

# Load the trained model
clf = joblib.load('random_forest_model.joblib')

# Load the fitted LabelEncoder
le = joblib.load('label_encoder.joblib')

st.title('Walmart Fraud Detection System')

st.header('Enter Transaction Details:')

amount = st.number_input('Amount', min_value=0.0, value=100.0)
device_id = st.selectbox('Device Type', ['mobile', 'desktop', 'tablet'])
return_count = st.number_input('Return Count', min_value=0, value=0, step=1)
coupon_used = st.radio('Coupon Used', [0, 1], format_func=lambda x: 'Yes' if x == 1 else 'No')

if st.button('Predict Fraud'):
    # Create a DataFrame from user input
    input_data = pd.DataFrame({
        'Amount': [amount],
        'DeviceID': [device_id],
        'ReturnCount': [return_count],
        'CouponUsed': [coupon_used]
    })

    # Preprocess the 'DeviceID'
    input_data['DeviceID'] = le.transform(input_data['DeviceID'])

    # Make prediction
    prediction_proba = clf.predict_proba(input_data)
    fraud_probability = prediction_proba[0][1] # Probability of class 1 (fraud)

    # Determine the predicted class
    predicted_class = 1 if fraud_probability >= 0.5 else 0

    st.subheader('Prediction Result:')
    if predicted_class == 1:
        st.error(f'Fraudulent Transaction Predicted! Probability: {fraud_probability:.2f}')
    else:
        st.success(f'Transaction is Likely Not Fraudulent. Probability: {fraud_probability:.2f}')

Overwriting app.py


**Reasoning**:
Add a markdown cell to provide instructions on how to run the Streamlit application.



In [None]:
%%markdown
## How to Run the Streamlit Application

1.  **Save the code:** Ensure the Python code for the Streamlit application is saved in a file named `app.py`. If you used the `%%writefile app.py` command above, this step is already done.
2.  **Open a terminal or command prompt:** Navigate to the directory where you saved the `app.py` file.
3.  **Run the application:** Execute the following command in the terminal:

    ```bash
    streamlit run app.py
    ```

A new tab will open in your web browser with the Streamlit application running.

## How to Run the Streamlit Application

1.  **Save the code:** Ensure the Python code for the Streamlit application is saved in a file named `app.py`. If you used the `%%writefile app.py` command above, this step is already done.
2.  **Open a terminal or command prompt:** Navigate to the directory where you saved the `app.py` file.
3.  **Run the application:** Execute the following command in the terminal:

    ```bash
    streamlit run app.py
    ```

A new tab will open in your web browser with the Streamlit application running.


## Summary:

### Data Analysis Key Findings

*   The trained `RandomForestClassifier` model and the fitted `LabelEncoder` were successfully saved to `random_forest_model.joblib` and `label_encoder.joblib` files using `joblib`.
*   A Streamlit application file (`app.py`) was created to host the fraud detection system.
*   The saved model and LabelEncoder were successfully loaded into the Streamlit application.
*   The Streamlit interface was designed with input fields for Amount, Device ID, Return Count, and Coupon Used.
*   The application includes logic to preprocess user input, predict the probability of fraud using the loaded model, and display the prediction result and probability.
*   Instructions were provided on how to save the `app.py` file and run the Streamlit application from the terminal using `streamlit run app.py`.

### Insights or Next Steps

*   The Streamlit application provides a user-friendly interface for real-time fraud prediction based on the trained model.
*   Consider adding input validation to ensure the user provides appropriate values for each field in the Streamlit app.
