# Enhancing Fraud Detection with Machine Learning on AWS

This project exemplifies the use of machine learning and cloud technology to significantly enhance fraud detection mechanisms within financial systems. Leveraging Amazon Web Services (AWS) for storage and computation, the project integrates a LightGBM machine learning model, pre-trained and serialized, to predict and identify fraudulent transactions effectively. The implementation showcases how data can be retrieved and processed directly from AWS S3 storage, utilizing Python's powerful libraries for data manipulation and model application.

Detailed Explanation:
Cloud Integration and Data Retrieval:
The integration begins with data retrieval from AWS S3, a robust and scalable cloud storage service. Using the Boto3 library, the project connects to an S3 bucket to fetch transaction data previously stored in CSV format. This step demonstrates the seamless integration between cloud storage solutions and local computational environments, enabling efficient data handling without the need for physical storage solutions.

Machine Learning Model Utilization:
A pre-trained LightGBM model, stored as a pickle file, is loaded into the environment. LightGBM is a gradient boosting framework that is renowned for its speed and efficiency, particularly suitable for handling large datasets with many features, making it an excellent choice for fraud detection tasks. Once loaded, the model is applied to the new transaction data to predict the likelihood of fraud.

Predictive Analytics:
The predictions generated by the model are appended to the original dataset as a new column, thereby enriching the data with predictive insights. This approach allows for immediate identification of potentially fraudulent transactions, which are flagged based on a predefined threshold (e.g., prediction scores greater than 0.5). These flagged transactions are then isolated for further analysis or manual review.

Output and Visualization:
The enriched dataset, now containing both the original transaction details and the fraud predictions, can be used to generate reports or visualizations that help in understanding the pattern of fraud across transactions. Additionally, high-risk transactions can be extracted and listed, providing a clear and actionable output that can be utilized by security teams to mitigate risks.


This project illustrates the practical application of machine learning in the realm of fraud detection, enhanced by the power of cloud computing. By leveraging AWS for data storage and management, and by employing a sophisticated machine learning algorithm like LightGBM, it is possible to create a highly effective system that not only detects but also helps in preventing fraud in a dynamic financial environment. This integration not only optimizes the detection processes but also ensures scalability and accessibility, crucial for adapting to ever-evolving fraud techniques.

In [8]:
import aws_s3 #user defined
import pandas as pd
import pickle
import boto3
from io import StringIO

In [3]:
# Load the trained model from the pickle file
with open('clf.pkl', 'rb') as file:
    lgb_model = pickle.load(file)

In [14]:
bucket_name = ''
file_key = ''

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
data = obj['Body'].read().decode('utf-8')

df = pd.read_csv(StringIO(data))

In [16]:
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,TransactionID_x,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,...,V316,V317,V318,V319,V320,V321,hour,day,dow,month
0,3663549,31.95,4,10409,111.0,150.0,4,226.0,2,170.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,2,0,7
1,3663550,49.00,4,4272,111.0,150.0,4,226.0,2,299.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,2,0,7
2,3663551,171.00,4,4476,574.0,150.0,4,226.0,2,472.0,...,0.0,0.0,0.0,0.0,263.0,0.0,0,2,0,7
3,3663552,284.95,4,10989,360.0,150.0,4,166.0,2,205.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,2,0,7
4,3663553,67.95,4,18018,452.0,150.0,2,117.0,2,264.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,2,0,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,3673544,92.00,4,16659,170.0,150.0,4,226.0,1,204.0,...,0.0,0.0,0.0,0.0,0.0,0.0,16,5,3,7
9996,3673545,59.00,4,12501,490.0,150.0,4,226.0,2,272.0,...,0.0,0.0,0.0,0.0,0.0,0.0,16,5,3,7
9997,3673546,103.95,4,7585,553.0,150.0,4,226.0,1,204.0,...,0.0,0.0,0.0,0.0,0.0,0.0,16,5,3,7
9998,3673547,39.00,4,1632,350.0,150.0,2,224.0,2,231.0,...,0.0,0.0,0.0,0.0,0.0,0.0,16,5,3,7


In [17]:
predictions = lgb_model.predict(df)

# Now you can use the 'predictions' as needed
print(predictions)

[0.00737636 0.00209312 0.00188366 ... 0.00391351 0.00050201 0.00017588]


In [18]:
# Add the array as a new column to the DataFrame
df['predictions'] = predictions
df


Unnamed: 0,TransactionID_x,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,...,V317,V318,V319,V320,V321,hour,day,dow,month,predictions
0,3663549,31.95,4,10409,111.0,150.0,4,226.0,2,170.0,...,0.0,0.0,0.0,0.0,0.0,0,2,0,7,0.007376
1,3663550,49.00,4,4272,111.0,150.0,4,226.0,2,299.0,...,0.0,0.0,0.0,0.0,0.0,0,2,0,7,0.002093
2,3663551,171.00,4,4476,574.0,150.0,4,226.0,2,472.0,...,0.0,0.0,0.0,263.0,0.0,0,2,0,7,0.001884
3,3663552,284.95,4,10989,360.0,150.0,4,166.0,2,205.0,...,0.0,0.0,0.0,0.0,0.0,0,2,0,7,0.000505
4,3663553,67.95,4,18018,452.0,150.0,2,117.0,2,264.0,...,0.0,0.0,0.0,0.0,0.0,0,2,0,7,0.004822
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,3673544,92.00,4,16659,170.0,150.0,4,226.0,1,204.0,...,0.0,0.0,0.0,0.0,0.0,16,5,3,7,0.009207
9996,3673545,59.00,4,12501,490.0,150.0,4,226.0,2,272.0,...,0.0,0.0,0.0,0.0,0.0,16,5,3,7,0.004385
9997,3673546,103.95,4,7585,553.0,150.0,4,226.0,1,204.0,...,0.0,0.0,0.0,0.0,0.0,16,5,3,7,0.003914
9998,3673547,39.00,4,1632,350.0,150.0,2,224.0,2,231.0,...,0.0,0.0,0.0,0.0,0.0,16,5,3,7,0.000502


In [19]:
# Assuming the new column is named 'new_column' in the DataFrame 'df'
count = (df['predictions'] > 0.5).sum()

print(f"Number of rows with 'new_column' value > 0.5: {count}")


Number of rows with 'new_column' value > 0.5: 130


In [20]:
# Assuming the new column is named 'new_column' in the DataFrame 'df'
high_risk_rows = df[df['predictions'] > 0.5]

# Get the 'TransactionID_x' values of the high-risk rows
high_risk_transaction_ids = high_risk_rows['TransactionID_x']

print("TransactionID_x values with high risk of fraud:")
print(high_risk_transaction_ids)


TransactionID_x values with high risk of fraud:
220     3663769
299     3663848
389     3663938
393     3663942
400     3663949
         ...   
9130    3672679
9168    3672717
9169    3672718
9368    3672917
9526    3673075
Name: TransactionID_x, Length: 130, dtype: int64


In [24]:
# Assuming the new column is named 'new_column' in the DataFrame 'df'
high_risk_rows = df[df['predictions'] > 0.5]

# Get the 'TransactionID_x' values of the high-risk rows
high_risk_transaction_ids = high_risk_rows['TransactionID_x']

high_risk_rows[['TransactionID_x']]

Unnamed: 0,TransactionID_x
220,3663769
299,3663848
389,3663938
393,3663942
400,3663949
...,...
9130,3672679
9168,3672717
9169,3672718
9368,3672917
