# Go to this [link](https://colab.research.google.com/drive/1UqbekDSqdx4rAU16bKWTmCGlVWUbyrI2) for colab file to see results. Github Has file limit of 25 MB and with results this file is 29 MB

---



# Acknowledgment and Source of Data

The dataset contains subset of transactions made by credit cards in September 2013 by European cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 8000 transactions. The dataset is highly unbalanced.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, â€¦ V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

# Data Loading

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!gdown "140Q-IkU5PtXAsSC9DmUmmZYWJdtVtWHq"
df=pd.read_csv("creditcard.csv")
df.head()

In [None]:
df.shape

In [None]:
df.info()         #3 are object and rest are float

In [None]:
df.describe() #5 point summary for numerical columns

In [None]:
for col in df.columns:
  print(col,df[col].nunique())      #class are 2 rest are many unique values

In [None]:
df.columns

# Data cleaning and formatting

In [None]:
df.isna().sum()     #No null values

In [None]:
df[df.duplicated()].head()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

# EDA-Univariate analysis

In [None]:
#count of class to see if there is class imbalance

df['Class'].value_counts(normalize=True)

#there is clearly a class imbalance (~6% ) We will have to use sampling techniques to balance it out before training our models
# proportion
# Class
# 0	0.940734
# 1	0.059266



In [None]:
plt.figure(figsize=(3,3))
sns.countplot(x=df['Class'])
plt.show()


In [None]:
for col in df.columns:
  fig, ax =plt.subplots(1,2,figsize=(10,5))
  sns.histplot(data=df,x=col,ax=ax[0],kde=True)
  sns.boxplot(data=df,x=col,ax=ax[1])
  plt.show()

In [None]:
for col in df.columns:
  print(f"Column name: {col}")
  print("skewness",df[col].skew())
  print("kurtosis",df[col].kurtosis())
  print("-------------------------------")

# Bivariate Analysis

In [None]:
plt.figure(figsize=(50,50))
sns.pairplot(df,kind='scatter', hue = 'Class')
plt.show()

In [None]:
corr_matrix = df.corr()

plt.figure(figsize=(30,30))
sns.heatmap(corr_matrix, annot=True, cmap='rocket', fmt=".2f")
plt.title("Correlation Matrix of House Price Data")
plt.tight_layout()
plt.show()

-----------------------------------------------------------------------------------------------
