# **Fast Tag Fraud Detection**

The goal of this project is to create a reliable fraud detection system for Fastag transactions by utilizing machine learning classification approaches. Crucial features in the dataset include transaction amounts, vehicle information, geographic location, and transaction details. Ensuring the integrity and security of Fastag transactions requires building a strong model that can reliably detect instances of fraudulent activity.

**Importing the Dependencies**

In [74]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,accuracy_score, confusion_matrix, precision_score, f1_score, recall_score

# **Data Exploration**

Dataset Description:
1. Transaction_ID: Unique identifier for each transaction.
2. Timestamp: Date and time of the transaction.
3. Vehicle_Type: Type of vehicle involved in the transaction.
4. FastagID: Unique identifier for Fastag.
5. TollBoothID: Identifier for the toll booth.
6. Lane_Type: Type of lane used for the transaction.
7. Vehicle_Dimensions: Dimensions of the vehicle.
8. Transaction_Amount: Amount associated with the transaction.
9. Amount_paid: Amount paid for the transaction.
10. Geographical_Location: Location details of the transaction.
11. Vehicle_Speed: Speed of the vehicle during the transaction.
12. Vehicle_Plate_Number: License plate number of the vehicle.
13. Fraud_indicator: Binary indicator of fraudulent activity (target variable).

In [20]:
# Loading the dataframe to pandas dataframe
data = pd.read_csv('/content/FastagFraudDetection.csv')

In [21]:
data=pd.DataFrame(data)
data

Unnamed: 0,Transaction_ID,Timestamp,Vehicle_Type,FastagID,TollBoothID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Geographical_Location,Vehicle_Speed,Vehicle_Plate_Number,Fraud_indicator,Output
0,1,01-06-23 11:20,Bus,FTG-001-ABC-121,A-101,Express,Large,350,120,"13.059816123454882, 77.77068662374292",65,KA11AB1234,Fraud,1
1,2,01-07-23 14:55,Car,FTG-002-XYZ-451,B-102,Regular,Small,120,100,"13.059816123454882, 77.77068662374292",78,KA66CD5678,Fraud,1
2,3,01-08-23 18:25,Motorcycle,,D-104,Regular,Small,0,0,"13.059816123454882, 77.77068662374292",53,KA88EF9012,Not Fraud,0
3,4,01-09-23 2:05,Truck,FTG-044-LMN-322,C-103,Regular,Large,350,120,"13.059816123454882, 77.77068662374292",92,KA11GH3456,Fraud,1
4,5,01-10-23 6:35,Van,FTG-505-DEF-652,B-102,Express,Medium,140,100,"13.059816123454882, 77.77068662374292",60,KA44IJ6789,Fraud,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,01-01-23 22:18,Truck,FTG-445-EDC-765,C-103,Regular,Large,330,330,"13.21331620748757, 77.55413526894684",81,KA74ST0123,Not Fraud,0
4996,4997,1/17/2023 13:43,Van,FTG-446-LMK-432,B-102,Express,Medium,125,125,"13.21331620748757, 77.55413526894684",64,KA38UV3456,Not Fraud,0
4997,4998,02-05-23 5:08,Sedan,FTG-447-PLN-109,A-101,Regular,Medium,115,115,"13.21331620748757, 77.55413526894684",93,KA33WX6789,Not Fraud,0
4998,4999,2/20/2023 20:34,SUV,FTG-458-VFR-876,B-102,Express,Large,145,145,"13.21331620748757, 77.55413526894684",57,KA35YZ0123,Not Fraud,0


In [22]:
data.shape

(5000, 14)

# **Feature Engineering**

Identify and engineer relevant features that contribute to fraud detection accuracy.

**OneHot Encoding**

In [25]:
data['Vehicle_Type'].unique()

array(['Bus ', 'Car', 'Motorcycle', 'Truck', 'Van', 'Sedan', 'SUV'],
      dtype=object)

In [26]:
Lane_order=['Express', 'Regular']
Vehicle_Dimensions_order=['Large', 'Small', 'Medium']
Fraud_indicator_order=['Not Fraud','Fraud']

In [28]:
ohe = OneHotEncoder()
encode0 = ohe.fit_transform(data[['Vehicle_Type']]).toarray()

In [33]:
feature_labels = ohe.categories_
np.array(feature_labels).ravel()

array(['Bus ', 'Car', 'Motorcycle', 'SUV', 'Sedan', 'Truck', 'Van'],
      dtype=object)

In [34]:
feature_labels = np.array(feature_labels).ravel()
print(feature_labels)

['Bus ' 'Car' 'Motorcycle' 'SUV' 'Sedan' 'Truck' 'Van']


In [40]:
features = pd.DataFrame(encode0, columns = feature_labels)


In [45]:
df_new = pd.concat([data, features], axis=1)

In [46]:
new_dataset=df_new.drop(['Timestamp','FastagID','Vehicle_Type','TollBoothID','Geographical_Location','Vehicle_Plate_Number'], axis=1)

In [47]:
new_dataset

Unnamed: 0,Transaction_ID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Vehicle_Speed,Fraud_indicator,Output,Bus,Car,Motorcycle,SUV,Sedan,Truck,Van
0,1,Express,Large,350,120,65,Fraud,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Regular,Small,120,100,78,Fraud,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,3,Regular,Small,0,0,53,Not Fraud,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,4,Regular,Large,350,120,92,Fraud,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,Express,Medium,140,100,60,Fraud,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,Regular,Large,330,330,81,Not Fraud,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4996,4997,Express,Medium,125,125,64,Not Fraud,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4997,4998,Regular,Medium,115,115,93,Not Fraud,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4998,4999,Express,Large,145,145,57,Not Fraud,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


**Ordinal Encoding**

In [48]:
encode1 = OrdinalEncoder(categories=[Lane_order])
encode2 = OrdinalEncoder(categories=[Vehicle_Dimensions_order])
encode3 = OrdinalEncoder(categories=[Fraud_indicator_order])

In [49]:
encode1.fit(new_dataset[['Lane_Type']])
encode2.fit(new_dataset[['Vehicle_Dimensions']])
encode3.fit(new_dataset[['Fraud_indicator']])

In [50]:
new_lane=pd.DataFrame(encode1.transform(new_dataset[['Lane_Type']]))
new_dimensuions=pd.DataFrame(encode2.transform(new_dataset[['Vehicle_Dimensions']]))
new_fraud_indicator=pd.DataFrame(encode3.transform(new_dataset[['Fraud_indicator']]))

In [51]:
new_dataset['Lane_Type']= new_lane
new_dataset['Vehicle_Dimensions']= new_dimensuions
new_dataset['Fraud_indicator']=new_fraud_indicator

In [52]:
# dataset information
new_dataset

Unnamed: 0,Transaction_ID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Vehicle_Speed,Fraud_indicator,Output,Bus,Car,Motorcycle,SUV,Sedan,Truck,Van
0,1,0.0,0.0,350,120,65,1.0,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,1.0,1.0,120,100,78,1.0,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,3,1.0,1.0,0,0,53,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,4,1.0,0.0,350,120,92,1.0,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,0.0,2.0,140,100,60,1.0,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,1.0,0.0,330,330,81,0.0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4996,4997,0.0,2.0,125,125,64,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4997,4998,1.0,2.0,115,115,93,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4998,4999,0.0,0.0,145,145,57,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


**Handlling missing values**

In [53]:
# handlling missing values in dataset with dropping method
data = new_dataset.dropna(how='any')

In [54]:
# checking the missing values of each column
data.isnull().sum()

Transaction_ID        0
Lane_Type             0
Vehicle_Dimensions    0
Transaction_Amount    0
Amount_paid           0
Vehicle_Speed         0
Fraud_indicator       0
Output                0
Bus                   0
Car                   0
Motorcycle            0
SUV                   0
Sedan                 0
Truck                 0
Van                   0
dtype: int64

In [55]:
#dristibution of legit and fraud transactions
data['Output'].value_counts()

0    4017
1     983
Name: Output, dtype: int64

**Highly Unblanced dataset**

**0-> normal transaction**

**1-> Fraud transaction**

In [56]:
# separating data for analysis
legit = data[data.Output == 0]
fraud = data[data.Output == 1]

In [57]:
print(legit.shape)
print(fraud.shape)

(4017, 15)
(983, 15)


In [58]:
#statistical method of the data
legit.Transaction_Amount.describe()

count    4017.000000
mean      153.110530
std       114.435986
min         0.000000
25%        90.000000
50%       125.000000
75%       290.000000
max       350.000000
Name: Transaction_Amount, dtype: float64

In [59]:
fraud.Transaction_Amount.describe()

count    983.000000
mean     193.555443
std       97.465586
min       60.000000
25%      120.000000
50%      145.000000
75%      300.000000
max      350.000000
Name: Transaction_Amount, dtype: float64

In [60]:
#compare the values for both transactions
data.groupby('Output').mean()

Unnamed: 0_level_0,Transaction_ID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Vehicle_Speed,Fraud_indicator,Bus,Car,Motorcycle,SUV,Sedan,Truck,Van
Output,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,2618.24695,0.588748,0.86582,153.11053,153.11053,67.731392,0.0,0.13418,0.147374,0.177745,0.131939,0.137665,0.138412,0.132686
1,2019.330621,0.501526,0.819939,193.555443,92.83825,68.340793,1.0,0.180061,0.12411,0.0,0.187182,0.163784,0.160732,0.18413


**Under Sampling**


In [61]:
legit_sample = legit.sample(n=983)

**Concatenating two DataFrames**

In [62]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [63]:
new_dataset.head()

Unnamed: 0,Transaction_ID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Vehicle_Speed,Fraud_indicator,Output,Bus,Car,Motorcycle,SUV,Sedan,Truck,Van
2572,2573,1.0,1.0,0,0,53,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4234,4235,0.0,2.0,110,110,57,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3173,3174,1.0,1.0,60,60,43,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3739,3740,1.0,0.0,340,340,97,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4702,4703,0.0,2.0,125,125,61,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [64]:
#dristibution of legit and fraud transactions in new dataset
new_dataset['Output'].value_counts()

0    983
1    983
Name: Output, dtype: int64

In [65]:
new_dataset.groupby('Output').mean()

Unnamed: 0_level_0,Transaction_ID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Vehicle_Speed,Fraud_indicator,Bus,Car,Motorcycle,SUV,Sedan,Truck,Van
Output,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,2655.776195,0.618515,0.865717,155.84944,155.84944,68.314344,0.0,0.137335,0.160732,0.165819,0.122075,0.142421,0.144456,0.127162
1,2019.330621,0.501526,0.819939,193.555443,92.83825,68.340793,1.0,0.180061,0.12411,0.0,0.187182,0.163784,0.160732,0.18413


**Splitting the data into features and targets**

In [66]:
X = new_dataset.drop(columns='Output', axis=1)
Y = new_dataset['Output']

In [67]:
print(X)

      Transaction_ID  Lane_Type  Vehicle_Dimensions  Transaction_Amount  \
2572            2573        1.0                 1.0                   0   
4234            4235        0.0                 2.0                 110   
3173            3174        1.0                 1.0                  60   
3739            3740        1.0                 0.0                 340   
4702            4703        0.0                 2.0                 125   
...              ...        ...                 ...                 ...   
4957            4958        1.0                 0.0                 330   
4962            4963        0.0                 2.0                 115   
4970            4971        0.0                 0.0                 145   
4975            4976        1.0                 2.0                 125   
4999            5000        1.0                 0.0                 330   

      Amount_paid  Vehicle_Speed  Fraud_indicator  Bus   Car  Motorcycle  SUV  \
2572            0 

In [68]:
print(Y)

2572    0
4234    0
3173    0
3739    0
4702    0
       ..
4957    1
4962    1
4970    1
4975    1
4999    1
Name: Output, Length: 1966, dtype: int64


**Split data into training and testing**

In [69]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, train_size=0.8, stratify=Y, random_state=2)

In [70]:

print(X.shape, X_train.shape, X_test.shape)

(1966, 14) (1572, 14) (394, 14)


In [71]:
pd.DataFrame(X_train)

Unnamed: 0,Transaction_ID,Lane_Type,Vehicle_Dimensions,Transaction_Amount,Amount_paid,Vehicle_Speed,Fraud_indicator,Bus,Car,Motorcycle,SUV,Sedan,Truck,Van
4322,4323,1.0,1.0,0,0,52,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3758,3759,0.0,2.0,120,100,57,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2181,2182,1.0,2.0,125,90,95,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3938,3939,0.0,0.0,340,340,65,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1737,1738,1.0,1.0,70,70,41,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2557,2558,1.0,1.0,120,120,75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3718,3719,0.0,0.0,340,340,70,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4995,4996,1.0,0.0,330,330,81,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
318,319,0.0,0.0,350,350,52,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [72]:
pd.DataFrame(Y_train)

Unnamed: 0,Output
4322,0
3758,1
2181,1
3938,0
1737,0
...,...
2557,0
3718,0
4995,0
318,0


# **Model Development**

In [75]:
Models = {
    "Decision Tree":DecisionTreeClassifier(),
    "Random Forest":RandomForestClassifier(),
    "Logistic Regression":LogisticRegression(),
    "SVM Classification": SVC()
}

for i in range (len(list(Models))):
    Model=list(Models.values())[i]

    #train Model
    Model.fit(X_train, Y_train)

    #Make predictions
    Y_train_pred = Model.predict(X_train)
    Y_test_pred = Model.predict(X_test)

    #Training Performance
    model_train_Accuracy = accuracy_score(Y_train, Y_train_pred)
    model_train_Precision = precision_score(Y_train, Y_train_pred)
    model_train_recall = recall_score(Y_train, Y_train_pred)
    model_train_F1 = f1_score(Y_train, Y_train_pred, average='weighted')

    #Testing Performance
    model_test_Accuracy = accuracy_score(Y_test, Y_test_pred)
    model_test_Precision = precision_score(Y_test, Y_test_pred)
    model_test_recall = recall_score(Y_test, Y_test_pred)
    model_test_F1 = f1_score(Y_test, Y_test_pred, average='weighted')

    print(list(Models.keys())[i])

    print("Models Performance for Training Set")
    print("- Accuracy: {:.4f}".format(model_train_Accuracy))
    print("- Precision: {:.4f}".format(model_train_Precision))
    print("- Recall: {:.4f}".format(model_train_recall))
    print("- F1 Score: {:.4f}".format(model_train_F1))


    print("--------------------------")

    print("Models Performance for Testing Set")
    print("- Accuracy: {:.4f}".format(model_test_Accuracy))
    print("- Precision: {:.4f}".format(model_test_Precision))
    print("- Recall: {:.4f}".format(model_test_recall))
    print("- F1 Score: {:.4f}".format(model_test_F1))

    print('='*35)
    print('\n')

Decision Tree
Models Performance for Training Set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
--------------------------
Models Performance for Testing Set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Random Forest
Models Performance for Training Set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
--------------------------
Models Performance for Testing Set
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000


Logistic Regression
Models Performance for Training Set
- Accuracy: 0.9987
- Precision: 1.0000
- Recall: 0.9975
- F1 Score: 0.9987
--------------------------
Models Performance for Testing Set
- Accuracy: 0.9949
- Precision: 1.0000
- Recall: 0.9898
- F1 Score: 0.9949




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


SVM Classification
Models Performance for Training Set
- Accuracy: 0.6940
- Precision: 0.7402
- Recall: 0.5980
- F1 Score: 0.6912
--------------------------
Models Performance for Testing Set
- Accuracy: 0.6954
- Precision: 0.7251
- Recall: 0.6294
- F1 Score: 0.6941




# **Outcome**

*   Decision Tree and Random Forest have a 100% in Training and Test data accuracy than Logistic Regression of 99% and an SVC of 69.09%
*   When comparing precision & recall for 4 models, Here the Decision tree and Random forest performed much better than the Logistric Regression and SVC as we can see that the detection of fraud cases is around 100 % and 98 %, and  Logistric Regression and SVC of 72% and 62%.

*   So overall Decision tree and Random Forest Method performed much better in determining the fraud cases which is 100%
*   We can also improve on this accuracy by increasing the sample size or use deep learning algorithms however at the cost of computational expense.We can also use complex anomaly detection models to get better accuracy in determining more fraudulent cases.



