In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn import preprocessing

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from matplotlib import pyplot as plt

import seaborn as sns

In [None]:
df = pd.read_csv('/content/sample_data/credit card.csv')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14595 entries, 0 to 14594
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    14595 non-null  int64  
 1   V1      14595 non-null  float64
 2   V2      14595 non-null  float64
 3   V3      14595 non-null  float64
 4   V4      14595 non-null  float64
 5   V5      14595 non-null  float64
 6   V6      14595 non-null  float64
 7   V7      14595 non-null  float64
 8   V8      14595 non-null  float64
 9   V9      14595 non-null  float64
 10  V10     14595 non-null  float64
 11  V11     14595 non-null  float64
 12  V12     14595 non-null  float64
 13  V13     14595 non-null  float64
 14  V14     14595 non-null  float64
 15  V15     14595 non-null  float64
 16  V16     14595 non-null  float64
 17  V17     14595 non-null  float64
 18  V18     14595 non-null  float64
 19  V19     14595 non-null  float64
 20  V20     14595 non-null  float64
 21  V21     14594 non-null  float64
 22

The dataset contains 30 columns, Class is the target variable, while all others are features of the dataset. The most important variables are named from V1 to V28.

In [None]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


Preprocessing the dataset is a very important part of the analysis, it is used to remove outliers and duplicates from the dataset. Moreover, it is a very common practice to scale the columns on a standard scale, it helps in faster convergence and gives better results.

In [None]:
sum(df.duplicated())

53

There are over a thousand duplicate rows that should be removed from the dataset. Therefore, removing the duplicates using the line of code below:

In [None]:
df.drop_duplicates(inplace=True)

In addition to rows, sometimes there are columns in the data which do not give any meaningful information for the classification, therefore they should be removed from the data before training the model. One such column in our dataset is the Time column. It can be removed using the line of code given below:

In [None]:
df.drop('Time', axis=1, inplace=True)

After the data has been cleaned, the dataset columns can be separated into feature columns and target column. As mentioned before, the class column is the target column and everything else is a feature. Thus, doing that below:

In [None]:
X = df.iloc[:,df.columns != 'Class']
y = df.Class

Having done that, the dataset can be divided into training and test sets. The training set is used to train the classifier, while the test set can be used to evaluate the performance of the classifier on unseen instances.

In [None]:
X_train, X_test, y_train, y_test = train_test_split()
X, y, test_size=0.20, random_state=5, stratify=y)

SyntaxError: ignored

Before supplying the data to the classifier, the dataset is scaled using a standard scalar (as mentioned before). It is done using the code below:

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

Building the Logistic Regression Model

In [None]:
model = LogisticRegression()

The model can be trained by passing train set features and their corresponding target class values. The model will use that to learn to classify unseen examples.

In [None]:
model.fit(X_train_scaled, y_train)

Let’s first evaluate the model on training set and see the results:

In [None]:
train_acc = model.score(X_train_scaled, y_train)
print("The Accuracy for Training Set is {}".format(train_acc*100))

Over 99.9% accuracy, which is pretty good, but training accuracy is not that useful, test accuracy is the real metric of success.

Checking the performance on the test set.

In [None]:
test_acc = accuracy_score(y_test, y_pred)
print("The Accuracy for Test Set is {}".format(test_acc*100))

NameError: ignored

The test accuracy is also over 99.9% which is great.

Note that in most problems you will not be able to get this much accuracy, this problem was just best suited for Logistic Regression, therefore exceptional results were obtained.

Since this data is imbalanced (having very less number of cases when y =1). In cases like this, the Classification report gives more information than simple accuracy measures. It tells about precision and recall as well.

In [None]:
print(classification_report(y_test, y_pred))

NameError: ignored