# Credit Card Fraud Detection Dataset - Data Cleaning

We start by importing the necessary Python libraries, in this case Pandas and Numpy, we'll be using to clean and process the data.

In [2]:
import pandas as pd
import numpy as np

We use Pandas to load the Credit Card Fraud Detection Dataset into a Pandas dataframe. This allows us to easily manipulate and clean the data using the tools provided by Pandas.

In [3]:
df = pd.read_csv('creditcard.csv')

Check for missing values: Missing values can be a common issue in datasets, and can cause problems for our analysis. We check for missing values in the dataset using the Pandas isnull() method, and find that there are none in this particular dataset.

In [4]:
# Check for missing values
print("Missing values before cleaning: ")
print(df.isnull().sum())

Missing values before cleaning: 
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


There's no duplicates but in case it is we have to remove duplicates and checking for missing values again.

In [5]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [6]:
# Check for missing values again
print("Missing values after removing duplicates: ")
print(df.isnull().sum())

Missing values after removing duplicates: 
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


The 'Time' column in the contains the number of seconds between each transaction and the first transaction in the dataset. By converting this column to datetime format, we can more easily manipulate and analyze the data based on time-related features, such as day of the week, hour of the day, etc. This can be especially useful if we want to explore patterns and trends in the data over time. Using 'to_datetime()' we convert the values in the 'Time' column to a datetime format that Pandas can recognize and work with.

In [7]:
# Convert the 'Time' column to datetime format
df['Time'] = pd.to_datetime(df['Time'], unit='s')

After converting the 'Time' column to datetime format, we print the data types of each column using the dtypes attribute of the Pandas dataframe. This allows us to check that the 'Time' column was successfully converted to datetime format, and also to check the data types of other columns in the dataset. This can be useful for identifying any other columns that need to be converted to a different data type for analysis.

In [8]:
# Check the data types of each column
print("Data types of each column: ")
print(df.dtypes)

Data types of each column: 
Time      datetime64[ns]
V1               float64
V2               float64
V3               float64
V4               float64
V5               float64
V6               float64
V7               float64
V8               float64
V9               float64
V10              float64
V11              float64
V12              float64
V13              float64
V14              float64
V15              float64
V16              float64
V17              float64
V18              float64
V19              float64
V20              float64
V21              float64
V22              float64
V23              float64
V24              float64
V25              float64
V26              float64
V27              float64
V28              float64
Amount           float64
Class              int64
dtype: object


Outliers are data points that are significantly different from the other data points in a dataset. In the case of the 'Amount' column, outliers could be transactions with unusually large or small amounts. These outliers could be due to errors in the data or they could be legitimate transactions that are simply outside the usual range.

The code is calculating the first and third quartiles of the 'Amount' column and then calculating the interquartile range (IQR). The lower and upper bounds of the data are calculated as 1.5 times the IQR below the first quartile and above the third quartile, respectively. Any data points that fall outside these bounds are considered outliers and are printed out.

Checking for outliers is an important step in data cleaning as they can affect the analysis or models trained on the data. Removing outliers can help to improve the accuracy of the analysis or models.

In [12]:
# Check for outliers in the 'Amount' column
q1 = df['Amount'].quantile(0.25)
q3 = df['Amount'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
outliers = df[(df['Amount'] < lower_bound) | (df['Amount'] > upper_bound)]
print(outliers)

                      Time        V1        V2        V3        V4        V5  \
2      1970-01-01 00:00:01 -1.358354 -1.340163  1.773209  0.379780 -0.503198   
20     1970-01-01 00:00:16  0.694885 -1.361819  1.029221  0.834159 -1.191209   
51     1970-01-01 00:00:36 -1.004929 -0.985978 -0.038039  3.710061 -6.631951   
64     1970-01-01 00:00:42 -0.522666  1.009923  0.276470  1.475289 -0.707013   
85     1970-01-01 00:00:55 -4.575093 -4.429184  3.402585  0.903915  3.002224   
...                    ...       ...       ...       ...       ...       ...   
284735 1970-01-02 23:58:47 -1.661169 -0.565425  0.294268 -1.549156 -2.301359   
284748 1970-01-02 23:58:58  1.634178 -0.486939 -1.975967  0.495364  0.263635   
284753 1970-01-02 23:59:03  1.465737 -0.618047 -2.851391  1.425282  0.893893   
284757 1970-01-02 23:59:05 -1.757643 -0.982659  1.091540 -1.409539 -0.662159   
284806 1970-01-02 23:59:52 -0.533413 -0.189733  0.703337 -0.506271 -0.012546   

              V6        V7        V8   

This is a good practice to follow after performing data cleaning, as it allows you to keep a separate record of the cleaned data and to easily access it for further analysis or modeling in the future.

In [13]:
# Save the cleaned data to a new CSV file
df.to_csv('cleaned_creditcard.csv', index=False)