In [1]:
import pandas as pd 

In [37]:
# reading in the data 
clean_transaction = pd.read_csv('clean_transaction.csv')

In [40]:
# code used to delete random column that gets written in from reading in csv, you can ignore this 
clean_transaction = clean_transaction.drop(clean_transaction.columns[0], axis=1)

In [41]:
clean_transaction.head()

Unnamed: 0,step,customer,age,gender,merchant,category,amount,fraud,customer_reliability,merchant_reliability,diff_previous_step,diff_previous_amount,mean_amount,diff_from_mean_amount
0,30,C1000148617,5,0,M1888755466,2,143.87,0,0.007634,0.25,,,35.091908,108.778092
1,38,C1000148617,5,0,M1741626453,7,16.69,0,0.007634,0.371212,8.0,-127.18,35.091908,-18.401908
2,42,C1000148617,5,0,M1888755466,2,56.18,0,0.007634,0.25,4.0,39.49,35.091908,21.088092
3,43,C1000148617,5,0,M840466850,6,14.74,0,0.007634,0.112938,1.0,-41.44,35.091908,-20.351908
4,44,C1000148617,5,0,M1823072687,0,47.42,0,0.007634,0.0,1.0,32.68,35.091908,12.328092


Remeber general conditions for the features (or the x values): 
- features cannot contain null values; either get rid of them, impute them, or create a new category for them  
- features must be numerical (or at least represented numerically)
- too many features can lead to overfitting or long run times, choose sparingly

In [42]:
# Basic way of ensuring that features satisfy general conditions 
clean_transaction = clean_transaction.dropna()

# Creating the Training/Validation/Testing Sets

The proportions that you set for your training/validation/testing sets can change the way your machine learning algorithm performs. What's the point of each set?

- __Training__: This set is used to train your algorithm. In almost all cases, this should be a majority of your data. Some important things to note: ensure that the training set contains all labels that you are trying to predict, ensure data is balanced, etc. Think logically; would your algorithm be able to predict the labels for something it hasn't even been trained to do? The model is as good as its training. THE MODEL SHOULD ONLY EVER BE TRAINED ON THIS SET OF DATA! Don't make the mistake of training on validation or testing data. 
- __Validation__: This set is used to help you choose your algorithm. When you're comparing algorithms and how well they do, sometimes running it on a lot of different sets or a lot of times can take a lot of resources or time. The validation set is the 'fake' testing set. Test and predict using the validation set and whichever model with the best validation set will tend to be the model with the best testing predictions. THIS SET OF DATA SHOULD ONLY BE USED FOR PREDICTIONS!
- __Testing__: This set is used to help you determine how good your algorithm actually is. This will hopefully hint at how well your algorithm will do on foreign data. THIS SET OF DATA SHOULD ONLY BE USED FOR PREDICTIONS!

Some important tips:

- As you work to manipulate your training/validation/testing sets you need to _ensure_ that your labels and features  __stay properly connected__. What I mean by that is that you don't want your features to accidentally have the wrong label; either ensure data is connected by manipulating the dataframe first, using tuples, or using some other form of connected data organization.
- Ensure that the data in your sets __are actually good__. Sometimes data is organized by labels (e.g. the first 100 data points are fraud, next 100 points are not fraud). Ensure that 1) your sets have a good variety of labels 2) your sets are a relatively good representation of the population you're trying to predict 3) you've dealt with issues such as balanced data


In [43]:
def train_validate_test(df,percent_train, percent_validate):
    '''
    Parameters: 
        df: the dataframe that you want to split up 
        percent_train: The percentage (in decimal form) that you want of df to be training set 
        percent_validate: The percentage (in decimal form) that you want of df to be validation set
    Returns: 
        if conditions are correct, returns df split into training, validate, and test (in that order)
    '''
    # assuming that you want the remainder of the data that isn't train/validate to be testing set
    if percent_train+percent_validate == 1: 
        print('your percentages are incorrect, you have no data left to be testing set')
        return 
    elif percent_train+percent_validate > 1:
        print('your percentages are incorrect, your percentages add up to more than 100%')
        return 
    else: 
        # what conditions might you want to set here? 
        # shuffle data: df = df.sample(frac=1).reset_index(drop=True)
        # ensure data is balanced by simulating training data to be 50% fraud and 50% not fraud 
        total_n = len(df)
        train_n = int(percent_train*total_n)
        validate_n = train_n+int(percent_validate*total_n)
        train_df = df.iloc[:train_n]
        validate_df = df.iloc[train_n:validate_n]
        test_df = df.iloc[validate_n:]
        return train_df,validate_df, test_df

In [44]:
train_transaction, validate_transaction, test_transaction = train_validate_test(clean_transaction,.5,.2)

In [45]:
# compare the length of the original dataframe and the new split up 
len(clean_transaction) == len(train_transaction)+len(validate_transaction)+len(test_transaction)

True

Please note that you can change the training, validation, and testing set as much as you please. It doesn't have to be the first n% of the data, nor does each data point have to be unique (although it would probably be a smarter idea to make them unique if you think about it). That means you can go ahead and simulate a 'fake' training set of 50% fraud and 50% not fraudalent data. You can literally do whatever you want as long as these rules aren't broken:
- training, validation, and the testing set must __all__ have the same number of features (or x values or columns)
- MAKE SURE FOR THE FUTURE THAT YOU ARE ONLY FITTING THE MODEL ON THE TRAINING SET AND NOTHING ELSE 
- once again, make sure your features (x values) and labels (y labels) are properly organized and correctly linked together 

# Splitting the data into features and labels

To my knowledge, almost all machine learning algorithms only accept lists so you'll have to convert everything into proper lists. The following is example code. Make sure you understand it, but there's a high chance you'll basically just copy and paste it (the code shouldn't change too much especially if you're working with dataframes and not lists to begin with).

In [55]:
def features_and_labels(df,list_interested_features, interested_label):
    '''
    Parameters:
        df: the dataframe that you're trying to split into features and labels 
        list_interested_features: a list of column names of the df that represent the features
        interested_label: ONE SINGULAR COLUMN NAME OF DF, the column that contains the item you're trying to predict
    Returns:
        returns two lists, features and labels (in that order)
    '''
    # ensure that the features you select are numbers 
    features = df[list_interested_features].values.tolist()
    labels = df[interested_label].values.tolist()
    return features, labels

In [70]:
train_X, train_y = features_and_labels(train_transaction,['age','gender','category','amount'],'fraud')
validate_X, validate_y = features_and_labels(validate_transaction,['age','gender','category','amount'],'fraud')
test_X, test_y = features_and_labels(test_transaction,['age','gender','category','amount'],'fraud')

Note that the features (X) for the train, validate, and test all are the same (age, gender, category, amount). Also note that the list of features are in the order of the inputs. For example,

In [59]:
train_X[0]

[5.0, 0.0, 7.0, 16.69]

The input for the features method was [age, gender, category, amount]. Therefore, train_X will be a list of lists.
For this first data point, 5 is the age, 0 is the gender, 7 is the category, 16.69 is the amount.

# Applying these features and labels into machine learning algorithms

In [61]:
# import machine learning algorithms here 
from sklearn.linear_model import LogisticRegression

The general outline for using your data in machine learning algorithms is as follows: 

model = Algorithm(parameters of algorithm) <br>
model.fit(trainX, trainY) <br>
predictions = model.predict(X of set that you want predictions of) <br>


You can modify model = Algorithm(parameters of algorithm) (by changing the algorithm or the parameters of the algorithm) and you can modify predictions = model.predict(X of set that you want predictions of) (by changing what you want predictions of. However, __you cannot modify model.fit(trainX,trainY).__ The model __must__ be fit on the training features and labels.

In [64]:
# example of logistic regression 
logistic_mod = LogisticRegression()

In [65]:
# DO NOT CHANGE THIS SYNTAX 
logistic_mod.fit(train_X,train_y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [63]:
training_predictions = logistic_mod.predict(train_X)
validation_predictions = logistic_mod.predict(validate_X)
testing_predictions = logistic_mod.predict(test_X)

Note that the input of predict is ALWAYS a features list (X). Also note that algorithm.predict() returns a list of predictions that are the same length as the parameter.

In [67]:
len(training_predictions) == len(train_y)

True

In [68]:
len(validation_predictions) == len(validate_y)

True

In [71]:
len(testing_predictions) == len(test_y)

True

Algorithm.predict() is simply outputting the predicted labels of the set that you inputted in the parameters.

# Determining how good the algorithm is 

There are multiple ways of measuring how good an algorithm is. While of course the standard is accuracy, high accuracy does __not__ necessarily mean the algorithm is good (especially in cases of imbalanced data). Make sure to test other accuracy tests such as BER, accuracy, precision, etc. 

In [76]:
def accuracy(predictions,actual_labels):
    predictions = predictions.tolist()
    is_equal = [] 
    for i in range(len(predictions)):
        if predictions[i] == actual_labels[i]:
            is_equal.append(True)
        else:
            is_equal.append(False)
    total_correct = sum(is_equal)
    total_number = len(is_equal)
    return total_correct/total_number

In [77]:
accuracy(training_predictions,train_y)

0.9945675918242934

In [78]:
accuracy(validation_predictions,validate_y)

0.9935735695053596

In [80]:
accuracy(testing_predictions,test_y)

0.9939546172951005

Be WARY of high accuracy. This does NOT necessarily mean the model is good. Implement things such as BER.

In [104]:
def ber(predictions,actual_labels):
    temp_df = pd.DataFrame({'Predictions':predictions.tolist(),'Actual':actual_labels})
    temp_df['Prediction same as Actual'] = temp_df['Predictions'] == temp_df['Actual']
    actual_not_fraud = temp_df[temp_df['Actual']==0]
    actual_fraud = temp_df[temp_df['Actual']==1]
    fraud_rate = sum(actual_fraud['Prediction same as Actual'])/len(actual_fraud)
    not_fraud_rate = sum(actual_not_fraud['Prediction same as Actual'])/len(actual_not_fraud)
    ber = (fraud_rate+not_fraud_rate)/2
    return ber

In [105]:
ber(training_predictions,train_y)

0.7878090115881129

In [106]:
ber(validation_predictions,validate_y)

0.7900887098603275

In [107]:
ber(testing_predictions,test_y)

0.7840530651716171

Feel free to implement other methods that help measure different areas of accuracy.

# Final Words

In general this is the same for each different algorithms. There are multiple ways of improving your algorithm such as: 

- improving or modifying your training set so that the model is properly fitted (by modify
- changing the model itself or the parameters 
- changing the accuracy measures to ensure you're actually selecting the correct model 