<a href="https://colab.research.google.com/github/hussain0048/Projects-/blob/master/Credit_Card_Fraud_Detection(not_complete).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1-Introduction**

Fraud transactions or fraudulent activities are significant issues in many industries like banking, insurance, etc. Especially for the banking industry, credit card fraud detection is a pressing issue to resolve.

These industries suffer too much due to fraudulent activities towards revenue growth and lose customer’s trust. So these companies need to find fraud transactions before it becomes a big problem for them.  

Unlike the other machine learning problems, in credit card fraud detection the target class distribution is not equally distributed. It is popularly known as the class imbalance problem or unbalanced data issue.

#**2-Why do we need to find fraud transactions?**

For many companies, fraud detection is a big problem because they find these fraudulent activities after they experience high loss. 

Fraud activities happen in all  industries. We can't say only particular companies/industries suffer from these fraudulent activities or transactions. 

But when it comes to financial-related companies, this fraud transaction becomes more of an issue/problem.  So these companies want to detect fraud transactions before the fraud activities turn into significant damage to their company.

In the current generation, with high-end technology, still, on every 100 credit card transactions, 13% are falling into the fraudulent activities reported by the creditcards website.

A survey paper mentioned that in the year 1997, 63% of companies experienced one fraud in the past two years, and in another year 1999, 57% of companies experienced at least one fraud in the last one year. 

Here the point is not only fraud activities increase, but the way of doing scams also increases badly. 

Companies suffer from detecting fraud, and due to these fraudulent activities, many companies worldwide have lost billions of dollars yearly.

And one more thing, for any company, customer's trust is more important to achieve or reach some position in the business marketplace. If a company cannot find these fraudulent activities, companies lose customer's trust; then, they will suffer from customer churn.

#**3-Fraud Detection Approaches**

First, companies hire few people only for the detection of these kinds of activities or transactions. But here they must and should be experts in this field or domain, and also the team should have knowledge of how frauds occur in particular domains. This requires more resources, such as people's effort and time.

Second, companies changed manual processes to rule-based solutions. But this one also fails most of the time to detect frauds. 

Because in the real world, the way of doing frauds is changing drastically day by day. These rule-based systems follow some rules and conditions. If a new fraud process is different from others, then these systems fail. It requires adding that new rule to code and execute. 

Now companies are trying to adopt Artificial Intelligence or machine learning algorithms to detect frauds. Machine learning algorithms performed very well for this type of problem. 

#**4-What is Credit Card Fraud Detection?**

In the above section, we discussed the need for identifying fraudulent activities. The credit card fraud classification problem is used to find fraud transactions or fraudulent activities before they become a major problem to credit card companies. 

It uses the combination of fraud and non-fraud transactions from the historical data with different people's credit card transaction data to estimate fraud or non-fraud on credit card transactions.

In this article, we are using the popular credit card dataset. Let’s understand the data before we start building the fraud detection models.

#**5-Understanding of Credit Card Dataset**

For this credit card fraud classification problem, we are using the dataset which was downloaded from the Kaggle platform. 

You can find and download the dataset from here.

Before going to the model development part, we should have some knowledge about our dataset

Such as 

- What is the size of the dataset?
- How many features does the dataset have?
- What are the target values?
- How many samples under each target value? , etc.

If we know some information about the dataset, then we can decide what we have to do?. 

What are the questions we discussed above, all  we can explore by using the python pandas library. 

Let's jump to the data exploration part to find answers to all questions we have.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#**6-Data Explorations**
First, we need to load the dataset. After downloading the dataset, extract the data and keep the file in the dataset under the project folder. 

We can quickly load it using pandas.

In [None]:
"/content/drive/My Drive/Datasets/Credit Card Detection /creditcard.csv"

In [3]:
import pandas as pd
# load dataset
fraud_df = pd.read_csv("/content/drive/My Drive/Datasets/Credit Card Detection/creditcard.csv")

Our dataset is a CSV(Comma Separated Values) file. We can use the read_csv function from pandas to read the file. 

Ok, now find the answers for our above dataset related questions.

In [None]:
fraud_df.shape

Dataset has 284807 rows and 31 features. The result of the shape variable is a tuple that has the number of rows, number of columns of the dataset.

We can see how the dataset looks like. The below command showcases  only five rows, head() by default, gives 5 samples. 

In [None]:
fraud_df.head()

If you want to see more samples from the top, pass the number representing the number of samples you want to see like fraud_df.head(10). 

You can also see bottom samples by using the tail() function. Both are working in the same way.

We can get all the list of feature names.

In [None]:
fraud_df.tail()

From this, we know Class is the target variable, and the remaining all are features of our dataset.

Let's see what are the unique values we are having for the target variable.

In [None]:
fraud_df['Class'].unique()

The target variable Class has 0 and 1 values. Here

- 0 for non-fraudulent transactions
- 1 for fraudulent transactions

Because we aim to find fraudulent transactions, the dataset's target value has a positive value for that. 

Still, What is pending in data exploration questions? 

yeah, we have to check how many samples each target class is having.

In [None]:
fraud_df['Class'].value_counts()

Yeah, we have 284315 non-fraudulent transaction samples & 492 fraudulent transaction samples.

We will discuss more about the data in the later sections of this article. 

You are going to know the variation of this number of samples and how much impact on the model's performance, how we can evaluate model performance for this data, etc.

Still, now you only know about the dataset, such

- Dataset size
- Number of samples(rows) and features(columns)
- Names of the features
- About target variables, etc.

Now we will discuss different data preprocessing techniques for our dataset. 

The data preprocessing techniques will be completely different from the text preprocessing techniques we discussed in the natural language processing data preprocessing techniques article

#**7-Credit Card Data Preprocessing**

Preprocessing is the process of cleaning the dataset. In this step, we will apply different methods to clean the raw data to feed more meaningful data for the modeling phase. This method includes

- Remove duplicates or irrelevant samples
- Update missing values with the most relevant values 
- Convert one data type to another example, categorical to integers, etc.

Okay, now we will spend a couple of minutes checking the dataset and applying corresponding techniques to clean data. 

This step aims to improve the quality of the data.

**Removing irrelevant columns/features**

In our dataset, only one irrelevant or not useful feature id Time. So we can drop that feature from the dataset.

In [None]:
# make sure which features are useful & which are not
# we can remove irrelevant features
fraud_df = fraud_df.drop(['Time'], axis=1)
fraud_df.columns

# **References**
[[1]Credit Card Fraud Detection](https://dataaspirant.com/credit-card-fraud-detection-classification-algorithms-python/?fbclid=IwAR2s8dv8K2ETBEJYbYWNiLSpbfBqR_iG7YvBOdene8z7TJk7EedTab8YYG0)