# **1. Introduction**

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.



## **Content**
This data file includes information about the time of transaction, different features of transaction, amount of transaction and if the transaction is legit or not.

## **Acknowledgements**
This public dataset can be found on this [link](https://www.kaggle.com/mlg-ulb/creditcardfraud).

## **Process for this model building**
1. Importing necessary libraries and dataset
2. Getting familiar with the dataset
3. Cleaning the dataset
4. Processing the dataset
5. Splitting the data into Train and Test data
6. Training the model using Train data
7. Checking the accuracy score of the model

# **2. Prepare**

**Information on the Dataset**

* The dataset contains transactions made by credit cards in September 2013 by European cardholders.
* This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

**Limitations of Dataset:**

* Data is only of approximately 2 Days of time period.

**Is Data ROCCC ?**

A good data source is ROCCC which stands for **R**eliable, **O**riginal, **C**omprehensive, **C**urrent, and **C**ited.

* Reliable - LOW - Not reliable as it only has data of 2 Days
* Original - MED - Collected and analysed during a research collaboration of [Worldline and the Machine Learning Group](https://mlg.ulb.ac.be/)
* Comprehensive - LOW - Parameters are hidden to protect the credit card information
* Current - LOW - Data is not known how old this data is and may be not relevant
* Cited - MED - Data collected from Worldline and the Machine Learning Group

Overall, the dataset is considered to be of Good quality data but data is only of 2 Days. Hence, it is not recommended to produce recommendations based on this data.

**Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

**Loading the dataset to a Pandas DataFrame**

In [None]:
credit_card_data = pd.read_csv('../input/creditcardfraud/creditcard.csv')

**First 5 Rows of the dataset**

In [None]:
credit_card_data.head()

**Columns Information:**

1. Time: Time of the transaction (Counted in seconds)
2. Amount: Amount of transaction (in US Dollars)
3. V1-V28 : Features of a particular transaction
4. Class: Informs about whether the transaction in legit or not (0: Legit , 1: Fraud)

**Last 5 rows of the dataset**

In [None]:
credit_card_data.tail()

* **Finding:**

The last entry of time is 172792 seconds which is equal to 2 Days of Dataset.

**Datatype of each column**

In [None]:
credit_card_data.info()

Datatype looks good. No need to change here anything

# **3: Process**
**Key Objective:**

* Observe and familiarize with data
* Check for null or missing values
* Perform sanity check of data

**Checking the no. of Null Values**

In [None]:
credit_card_data.isnull().sum()

This also looks good. Moving Further.

**Distribution of legit transaction & fraud transaction**

In [None]:
credit_card_data['Class'].value_counts()

* **Finding:**

Here, this shows the dataset is very unbalanced because the class of fraud transaction is very less and this can interfere with our machine learning algorithm. Our model may not be able to differentiate between Fraud and Legit trasaction using this dataset. 

One thing that can be done is distributing the Legit and Fraud class trasaction in equal amounts.

**Seperating data for analysis**

In [None]:
legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [None]:
print(legit.shape)
print(fraud.shape)

**Statistical measures of data**

In [None]:
legit.Amount.describe()

In [None]:
fraud.Amount.describe()

* **Finding:**

Look at the mean difference of both transaction. Fraud is having a greater mean than legit transaction.

**Compare values for both transaction**

In [None]:
credit_card_data.groupby('Class').mean()

**Under-Sampling the Dataset**

Build a sample dataset containing similar distribution of normal transaction and Fraud Transaction.

No. of Fraud Transaction - 492

In [None]:
#sample will pick random rows from the legit data

legit_sample = legit.sample(n=492)

In [None]:
legit_sample.head()

**Merging 2 Dataframes**

In [None]:
#making a new dataset which will have same amount of rows of fraud and legit transactions

new_dataset = pd.concat([legit_sample,fraud], axis = 0)

In [None]:
new_dataset.head()

In [None]:
new_dataset.tail()

In [None]:
new_dataset.info()

In [None]:
new_dataset.Class.value_counts()

* **Findings:**

Now, this dataset is balanced as the count of Legit and Fraud is equal. Now our model can easily work with this and differentiate between Fraud and Legit transactions.

In [None]:
new_dataset.groupby('Class').mean()

* **Findings:**

We can see that numbers have not changed a lot as this looks almost similar to the previous means.

**Splitting the data into Features and Targets**

In [None]:
X = new_dataset.drop(columns='Class',axis=1)     # Features
Y = new_dataset['Class']                         #Target

In [None]:
X

In [None]:
Y

# **4. Spliting the data into train and test data**

In [None]:
# 20% of the data will be taken for the Testing and 80% of the data will be taken for the Training

X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

# **5. Model Training**

### **Using Logistic Regression**

In [None]:
# Defining Logistic Regression into a variable

model = LogisticRegression()

**Training the Logistic Regression model with training data**

In [None]:
# Fitting the train dataset into the model for Training

model.fit(X_train,Y_train)

** 'str' object has no attribute 'decode'**

> 

# **6. Model Evaluation**

### **Accuracy Score**

In [None]:
# Accuracy on training data

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction,Y_train)

In [None]:
# Accuracy on test data

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction,Y_test)

In [None]:
print('Accuracy on Training Data: ',training_data_accuracy)
print('Accuracy on Test Data: ',test_data_accuracy)

**Finding:**

* Accuracy on train dataset is above 90% which is considered fairly good.
* Accuracy on test dataset is close to 90% which is considered fairly good.

## **Conclusion:**

This model can be used for checking whether a transaction is Legit or Fraud providing the features and other information of transaction is unbias and not manipulated.