Hello my Friends. After almost 2 years of experiance data science, I am finally making my first post!
The credit fraud dataset is a really nice exercise to start my journey in Kaggle. It's easy and clean, no much effort needed since all components are already PCA-ed. 

For this exercise I did basic analysis of the dataset and fitted the values to a few training models with minimal preprocessing, little cross-validation trickery and no fancy boosting. The results were too good to be true, probably thanks to the pre-done PCA. I intend to present my work as a showcase of my beginner's skill and would appreciate any kind of feedback on how I can improve.

Anyway, let's start.

We first import the required modules, you can already see where I am going.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, cross_validation
from sklearn.cluster import KMeans
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier

Oops I seem to be using old modules, I don't where else to find cross-validation though.

Let's read the file and see what's in it.

In [None]:
df = pd.read_csv('../input/creditcard.csv')
df.head()

Scrolling to the right we can see the columns don't really mean much as a result of PCA. Class is either 0 or 1. Time is indexed from zero which doesn't show much other than it's sequential structure.

Let's go straight into exploratory analysis. We will start with inspection of whether the columns are really PCA-ed.

In [None]:
def plot_corr(df):
    #### Checking correlation of PCA variables
    corr = df.corr()
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    
    fig1, ax1 = plt.subplots()
    corrplot = sns.heatmap(corr, mask=mask, square=True, ax=ax1)
    plt.yticks(rotation=0)
    plt.xticks(rotation=90)

plot_corr(df)

We see that the PCA columns are not correlated to each other. Power of PCA.
Time, amount and class quite randomly correlated to the others, doesn't look suspicious here so I will leave them as they are.

Let's look at the fraud incidents over years now.

In [None]:
def plot_time(df):
##### Checking the time series (fraud)
    df1 = df.loc[df['Class']==1, :]
    df2 = df.loc[df['Class']==0, :]
    
    fig2, ax2 = plt.subplots()
    x1 = df1['Time']
    y1 = df1['Amount']
    x2 = df2['Time']
    y2 = df2['Amount']
    ax2.scatter(x2, y2, alpha=0.2, color = 'blue')
    ax2.scatter(x1, y1, alpha=0.8, color='red')
    ax2.set_xlim((0, max(x2)))
    ax2.set_ylim((0, max(y2)))
    
plot_time(df)

Pretty random to me. Could have done a bit more to sum them counts or amounts up by year but I didn't do it.

I wonder if the one at the top right corner corresponds to ...Lehman Bros? ...Madoff?  

Let's look at the distribution next.

In [None]:
def plot_dist(df):
    ##### Boxplot Violin plot to show distribution
    fig3, (ax3_1, ax3_2) = plt.subplots(ncols=2, sharey=True)
    sns.violinplot(x=df['Class'], y=df['Amount'], ax=ax3_1)
    sns.boxplot(x=df['Class'], y=df['Amount'], ax=ax3_2)
    
plot_dist(df)

We can see the the outliers heavily skew the plot with class 0. Not much to see here.

Now that we have done our due-diligence in charting, let's build some models and compare their accuracies. Here we are going to use KMeans, Regression and RandomForest. Three very different kinds of algos.

Let's show all the entire script and call them separately later.

In [None]:
def preproc(df):
    ##### Preprocessing
    X = np.array(df.drop(['Time', 'Class'], 1).astype(float))
    Y = np.array(df['Class'])
    
    X = preprocessing.scale(X)
    
    X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.3)

    return(X_train, X_test, Y_train, Y_test)

def mod_predict(x, y, model):
    prediction = model.predict(x)
    acc = abs(sum(prediction==y)/len(y))
    if acc <= 0.5:
        acc = 1 - acc
    return(acc)

def KMs(df):
    X_train, X_test, Y_train, Y_test = preproc(df)
    
    clf = KMeans(n_clusters=2)
    clf.fit(X_train)
    
    in_sample_acc = mod_predict(X_train, Y_train, clf)
    out_sample_acc = mod_predict(X_test, Y_test, clf)
    
    print('In-Sample accuracy : ' + str(round(in_sample_acc, 4)) +'\n' + 'Out-Sample accuracy : ' + str(round(out_sample_acc, 4)))

def LogReg(df):
    X_train, X_test, Y_train, Y_test = preproc(df)
    
    glm = linear_model.LogisticRegression()
    glm.fit(X_train, Y_train)
    
    in_sample_acc = mod_predict(X_train, Y_train, glm)
    out_sample_acc = mod_predict(X_test, Y_test, glm)
    
    print('In-Sample accuracy : ' + str(round(in_sample_acc, 4)) +'\n' + 'Out-Sample accuracy : ' + str(round(out_sample_acc, 4)))
    
def randomforest(df):
    X_train, X_test, Y_train, Y_test = preproc(df)
    
    rf = RandomForestClassifier()
    rf.fit(X_train, Y_train)
    
    in_sample_acc = mod_predict(X_train, Y_train, rf)
    out_sample_acc = mod_predict(X_test, Y_test, rf)
    
    print('In-Sample accuracy : ' + str(round(in_sample_acc, 4)) +'\n' + 'Out-Sample accuracy : ' + str(round(out_sample_acc, 4)))
    
    imp = rf.feature_importances_
    fig4, ax4 = plt.subplots()
    sns.barplot(list(range(len(imp))), imp)
    fig4.suptitle('PCA feature importance')
    


Here, *preproc* does the preprocessing i.e. the centering, scaling and train/test splitting. *mod predict* takes one of the models and run it with either the train or test set and spits out an accuracy.

Now let's run our models.

In [None]:
print('--------KMeans--------')
KMs(df)
print('--------Logistic Regression--------')
LogReg(df)
print('--------Random Forest--------')
randomforest(df)

Just like we would have expected, Random Forest came up first, followed by Logistic Regression and KMeans. This is in the same order as the model complexities, namely the ensemble boosting of Random Forest, log likelihood maximisation of Logistic Regression and the whatever is in KMeans.

Still, 99.99% accuracy is quite outrageous and I would like you to tell me if I am correct!

Also I have included the relative importance of the PCA columns. Nothing interesting here other than that they are quite well dispersed, possible effect of PCA.

And there you go, thank you! Feedback appreciated!