# Key Steps to Successfully Complete the Credit Card Fraud Detection Project



1. **Define the Problem and Objectives**
   - Clearly state the problem: Detect fraudulent transactions.
   - Objective: Minimize false positives while maintaining high detection rates.

2. **Data Collection and Understanding**
   - Gather the dataset.
   - Understand the features and labels.
   - Assess the quality and structure of the data.

In [97]:
import pandas as pd
import numpy as np

df = pd.read_csv('CreditCardData.csv')
df.head()

Unnamed: 0,Transaction ID,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud
0,#3577 209,14-Oct-20,Wednesday,19,Visa,Tap,£5,POS,Entertainment,United Kingdom,United Kingdom,United Kingdom,M,25.2,RBS,0
1,#3039 221,14-Oct-20,Wednesday,17,MasterCard,PIN,£288,POS,Services,USA,USA,USA,F,49.6,Lloyds,0
2,#2694 780,14-Oct-20,Wednesday,14,Visa,Tap,£5,POS,Restaurant,India,India,India,F,42.2,Barclays,0
3,#2640 960,13-Oct-20,Tuesday,14,Visa,Tap,£28,POS,Entertainment,United Kingdom,India,United Kingdom,F,51.0,Barclays,0
4,#2771 031,13-Oct-20,Tuesday,23,Visa,CVC,£91,Online,Electronics,USA,USA,United Kingdom,M,38.0,Halifax,1


In [40]:
for col in df:
    print(col, ': ', df[col].unique()) 

Transaction ID :  ['#3577 209' '#3039 221' '#2694 780' ... '#3304 849' '#3532 129'
 '#3107 092']
Date :  ['14-Oct-20' '13-Oct-20' '16-Oct-20' '15-Oct-20']
Day of Week :  ['Wednesday' 'Tuesday' 'Thursday' 'Friday']
Time :  [19 17 14 23 20 18 11  1 21  0  8  9 15 22  7  3 12  4  5 10 13  2 16  6
 24]
Type of Card :  ['Visa' 'MasterCard']
Entry Mode :  ['Tap' 'PIN' 'CVC']
Amount :  ['£5' '£288' '£28' '£91' '£30' '£231' '£154' '£39' '£17' '£326' '£106'
 '£21' '£211' '£98' '£25' '£242' '£22' '£29' '£397' '£38' '£155' '£12'
 '£9' '£381' '£94' '£16' '£349' '£49' '£270' '£13' '£11' '£314' '£320'
 '£6' '£206' '£57' '£59' '£26' '£153' '£27' '£20' '£123' '£285' '£24'
 '£23' '£226' '£296' '£353' '£60' '£134' '£80' '£241' '£86' '£309' '£161'
 '£261' '£264' '£10' '£53' '£119' '£19' '£330' '£14' '£110' '£18' '£173'
 '£115' '£189' '£219' '£268' '£341' '£92' '£7' '£145' '£382' '£358' '£318'
 '£93' '£66' '£130' '£325' '£111' '£380' '£347' '£233' '£143' '£129' '£70'
 '£79' '£62' '£375' '£52' '£180' '£95'

In [106]:
df['Fraud'].sum()

7195

3. **Data Preprocessing**
   - **Cleaning**: Handle missing values, remove duplicates, and correct errors.
   - **Feature Engineering**: Create new features that might help the model (e.g., transaction frequency, average transaction amount).
   - **Encoding Categorical Variables**: Convert categorical variables into numerical ones using techniques like one-hot encoding.
   - **Normalization/Standardization**: Scale the features to ensure they contribute equally to the model.

In [99]:
# Remove pound symbol from Amount:
df['Amount'] = df['Amount'].str.replace('£', '')

df['Amount'].head()

0      5
1    288
2      5
3     28
4     91
Name: Amount, dtype: object

In [116]:
# Display rows with null values
df[df.isnull().any(axis=1)]

Unnamed: 0,Transaction ID,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud
286,#2550 151,13-Oct-20,Tuesday,13,Visa,PIN,19.0,ATM,,United Kingdom,United Kingdom,United Kingdom,F,52.9,Barclays,0
383,#2550 715,14-Oct-20,Wednesday,15,Visa,PIN,243.0,ATM,Products,Russia,Russia,Russia,,41.0,Barclays,0
1720,#2550 654,14-Oct-20,Wednesday,10,Visa,PIN,204.0,ATM,,United Kingdom,USA,United Kingdom,M,33.9,RBS,0
2092,#2550 150,14-Oct-20,Wednesday,21,Visa,CVC,11.0,Online,,India,India,India,M,48.9,Lloyds,0
4913,#2550 516,14-Oct-20,Wednesday,20,MasterCard,CVC,27.0,Online,Gaming,USA,,USA,F,51.0,Metro,0
6208,#2550 472,14-Oct-20,Wednesday,17,MasterCard,PIN,5.0,ATM,Food,United Kingdom,,United Kingdom,F,42.6,Halifax,0
6404,#2550 074,14-Oct-20,Wednesday,21,MasterCard,PIN,30.0,ATM,,United Kingdom,United Kingdom,United Kingdom,F,34.5,Barlcays,0
8299,#2550 517,14-Oct-20,Wednesday,17,MasterCard,CVC,13.0,Online,Children,China,,China,M,50.5,Barclays,0
8436,#2550 479,13-Oct-20,Tuesday,9,MasterCard,PIN,20.0,ATM,Children,United Kingdom,,United Kingdom,M,52.1,Barclays,0
34456,#2550 256,13-Oct-20,Tuesday,1,MasterCard,CVC,,Online,Children,USA,USA,United Kingdom,M,24.7,Barclays,1


In [119]:
# Remove rows with null values
df = df.dropna()

In [121]:
# Display duplicate rows
df[df.duplicated()]

Unnamed: 0,Transaction ID,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud


In [123]:
# Convert Amount to int
df['Amount'] = df['Amount'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Amount'] = df['Amount'].astype(int)


In [124]:
df.describe()

Unnamed: 0,Time,Amount,Age,Fraud
count,99977.0,99977.0,99977.0,99977.0
mean,14.5631,112.579933,44.993595,0.071937
std,5.308202,123.435613,9.948121,0.258384
min,0.0,5.0,15.0,0.0
25%,10.0,17.0,38.2,0.0
50%,15.0,30.0,44.9,0.0
75%,19.0,208.0,51.7,0.0
max,24.0,400.0,86.1,1.0


4. **Exploratory Data Analysis (EDA)**
   - Visualize data distributions and relationships using plots (histograms, scatter plots, etc.).
   - Identify patterns, correlations, and potential outliers.
   - Understand the balance of the target variable (fraud vs. non-fraud).

5. **Data Splitting**
   - Split the dataset into training and testing sets (commonly an 80/20 or 70/30 split).
   - Ensure the split maintains the distribution of the target variable.

6. **Model Selection**
   - Choose a variety of algorithms to test (e.g., Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Neural Networks).
   - Use a validation set or cross-validation to assess performance during training.

7. **Model Training**
   - Train the selected algorithms on the training data.
   - Use cross-validation to fine-tune model parameters and prevent overfitting.

8. **Hyperparameter Tuning**
   - Use techniques like Grid Search or Random Search to find the optimal parameters for each model.
   - Evaluate the models using cross-validation to ensure robustness.

9. **Model Evaluation**
   - Evaluate models on the testing set using metrics such as Accuracy, Precision, Recall, F1-Score, and ROC-AUC.
   - Use a confusion matrix to understand the types of errors (false positives and false negatives).

10. **Model Selection and Improvement**
    - Choose the best-performing model based on evaluation metrics.
    - Consider further tuning or combining models (ensemble methods) to improve performance.

11. **Final Model Training**
    - Train the final model on the entire dataset for deployment.
    - Save the trained model using libraries like Pickle or Joblib

12. **Deployment and Monitoring**
    - Deploy the model to a production environment.
    - Set up monitoring to track the model’s performance over time and identify potential issues.

13. **Documentation and Reporting**
    - Document the entire process, including data preprocessing steps, model selection, training, and evaluation.
    - Create a comprehensive report with visualizations and explanations of your findings and model performance.

14. **Future Work and Improvements**
    - Identify areas for future improvements, such as collecting more data, adding new features, or trying different algorithms.
    - Discuss potential enhancements to the model and the impact they could have.