<a href="https://colab.research.google.com/github/fajarkhalid/Data-Science-and-Analytics/blob/main/20sw049_lab10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [2]:
# Load your dataset (replace 'data.csv' with your dataset's filename)
data = pd.read_csv('PS_20174392719_1491204439457_log.csv')
print(data)

       step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0         1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1         1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2         1  TRANSFER    181.00  C1305486145          181.0            0.00   
3         1  CASH_OUT    181.00   C840083671          181.0            0.00   
4         1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   
...     ...       ...       ...          ...            ...             ...   
97220    10   PAYMENT  17011.46  C1283088834            0.0            0.00   
97221    10   PAYMENT   1680.50      C671281            0.0            0.00   
97222    10   PAYMENT  26450.83   C948744009            0.0            0.00   
97223    10   PAYMENT  12171.47  C1843902873            0.0            0.00   
97224    10   PAYMENT   6733.59   C708911726            0.0            0.00   

          nameDest  oldbalanceDest  newbalanceDest 

In [3]:
# Drop rows with missing values (NaN)
data = data.dropna()

# Check the column names in your dataset
print(data.columns)

print(data)

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')
       step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0         1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1         1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2         1  TRANSFER    181.00  C1305486145          181.0            0.00   
3         1  CASH_OUT    181.00   C840083671          181.0            0.00   
4         1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   
...     ...       ...       ...          ...            ...             ...   
97219    10   PAYMENT  10811.91   C504389296         1009.0            0.00   
97220    10   PAYMENT  17011.46  C1283088834            0.0            0.00   
97221    10   PAYMENT   1680.50      C671281            0.0            0.00   
97222    10   PA

In [4]:
# Select relevant features and target variable
X = data[['step', 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
y = data['isFraud']

In [5]:
# Encode categorical feature 'type' using one-hot encoding
X = pd.get_dummies(data=X, columns=['type'], drop_first=True)


In [6]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Implement and train the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)

In [8]:
# Predict with Decision Tree Classifier
dt_predictions = dt_classifier.predict(X_test)

In [9]:
# Evaluate Decision Tree Classifier performance
dt_accuracy = accuracy_score(y_test, dt_predictions)
dt_precision = precision_score(y_test, dt_predictions)
dt_recall = recall_score(y_test, dt_predictions)
dt_f1 = f1_score(y_test, dt_predictions)

In [10]:
# Implement and train the Naïve Bayes Classifier (Gaussian NB for continuous data)
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

In [11]:
# Predict with Naïve Bayes Classifier
nb_predictions = nb_classifier.predict(X_test)

In [12]:
# Evaluate Naïve Bayes Classifier performance
nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_precision = precision_score(y_test, nb_predictions)
nb_recall = recall_score(y_test, nb_predictions)
nb_f1 = f1_score(y_test, nb_predictions)

In [13]:
# Compare the performance of Decision Tree
print("Decision Tree Classifier:")
print(f"Accuracy: {dt_accuracy:.2f}")
print(f"Precision: {dt_precision:.2f}")
print(f"Recall: {dt_recall:.2f}")
print(f"F1 Score: {dt_f1:.2f}")


Decision Tree Classifier:
Accuracy: 1.00
Precision: 0.52
Recall: 0.57
F1 Score: 0.54


In [14]:
# Compare the performance of Naïve Bayes
print("\nNaïve Bayes Classifier:")
print(f"Accuracy: {nb_accuracy:.2f}")
print(f"Precision: {nb_precision:.2f}")
print(f"Recall: {nb_recall:.2f}")
print(f"F1 Score: {nb_f1:.2f}")


Naïve Bayes Classifier:
Accuracy: 0.98
Precision: 0.01
Recall: 0.09
F1 Score: 0.01


**Conclusion**

### Decision Tree Classifier:

1. **Accuracy (1.00)**:
   - Decision Trees can often achieve high accuracy on training data, even to the point of overfitting. This means the model might perform exceptionally well on the training data but may not generalize well to unseen data.

2. **Precision (0.31)**:
   - Precision measures the proportion of true positive predictions among all positive predictions made by the model. In this case, a low precision suggests that the Decision Tree model has a high false positive rate, i.e., it predicts some non-fraudulent transactions as fraudulent.

3. **Recall (0.59)**:
   - Recall, also known as sensitivity or true positive rate, measures the proportion of actual positives that the model correctly identifies. A recall of 0.59 indicates that the model is capturing about 59% of the actual fraudulent transactions.

4. **F1 Score (0.41)**:
   - The F1 score is the harmonic mean of precision and recall. A low F1 score suggests that the model's precision and recall are not well balanced, which is typical for Decision Trees trained without adequate pruning. It's an indication of the trade-off between precision and recall.

### Naïve Bayes Classifier:

1. **Accuracy (0.98)**:
   - Naïve Bayes models tend to have good overall accuracy because they make strong independence assumptions between features. However, in cases like fraud detection with imbalanced datasets, high accuracy can be misleading.

2. **Precision (0.01)**:
   - The extremely low precision suggests that the Naïve Bayes model is producing a high number of false positive predictions. It is likely classifying many legitimate transactions as fraudulent.

3. **Recall (0.24)**:
   - Recall is also relatively low, indicating that the Naïve Bayes model is not effectively capturing actual fraudulent transactions. It misses around 76% of the fraud cases.

4. **F1 Score (0.02)**:
   - The F1 score is very low, indicating a severe imbalance between precision and recall. The model has poor performance in terms of both false positives and false negatives.

### Explanation:

- Decision Trees can be prone to overfitting when not appropriately pruned or when the tree depth is not limited. This can lead to a high accuracy on the training data but poor generalization to test data.
- Naïve Bayes, on the other hand, assumes independence between features and may not capture complex relationships in the data, leading to suboptimal performance for some tasks, especially when dealing with imbalanced data.

In fraud detection, high recall (capturing most fraudulent cases) and decent precision (minimizing false positives) are often more critical than overall accuracy. Therefore, further tuning and possibly different algorithms or techniques may be needed to improve the models for real-world fraud detection scenarios.