# Detecting anomalies in credit card transactions

In this project, we will attempt to find anomalies in credit card transactions. This could help point out unusual behaviour by credit card customers which could be a result of fraud. In one csv, we have the data for the credit card numbers, the place where they were issued, and their respective credit limits. In another csv, we have the data for the credit card numbers, the date and time of the transaction, transaction amount, and the location of the transaction given by the latitudes and longitudes.

I took the following steps:
1. Adding the credit limit information to the dataframe with transaction amount information
2. Calculating the percentage of credit limit consumed in a particular transaction
3. Deriving the latitudes and longitudes of places where the credit cards were issued and adding this information to the transactions dataframe
4. Calculating the distance between place of transaction and place of registration
5. Normalising the data and building a model having a sigmoid layer at the end
6. Fitting the model and making predictions

## Preliminary steps

In [1]:
#Import all packages required for this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from geopy.geocoders import Nominatim
import pgeocode
from geopy.extra.rate_limiter import RateLimiter
import geopy.distance
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import Normalizer, MinMaxScaler
from sklearn.pipeline import Pipeline




In [2]:
# Load source CSVs to dataframes
cc_details_df = pd.read_csv('cc_info.csv')
transaction_details_df = pd.read_csv('transactions.csv')

In [3]:
cc_details_df.head()

Unnamed: 0,credit_card,city,state,zipcode,credit_card_limit
0,1280981422329509,Dallas,PA,18612,6000
1,9737219864179988,Houston,PA,15342,16000
2,4749889059323202,Auburn,MA,1501,14000
3,9591503562024072,Orlando,WV,26412,18000
4,2095640259001271,New York,NY,10001,20000


In [4]:
transaction_details_df.head()

Unnamed: 0,credit_card,date,transaction_dollar_amount,Long,Lat
0,1003715054175576,2015-09-11 00:32:40,43.78,-80.174132,40.26737
1,1003715054175576,2015-10-24 22:23:08,103.15,-80.19424,40.180114
2,1003715054175576,2015-10-26 18:19:36,48.55,-80.211033,40.313004
3,1003715054175576,2015-10-22 19:41:10,136.18,-80.174138,40.290895
4,1003715054175576,2015-10-26 20:08:22,71.82,-80.23872,40.166719


In [5]:
transaction_details_df.dtypes

credit_card                    int64
date                          object
transaction_dollar_amount    float64
Long                         float64
Lat                          float64
dtype: object

In [6]:
cc_details_df.dtypes

credit_card           int64
city                 object
state                object
zipcode               int64
credit_card_limit     int64
dtype: object

In [7]:
transaction_details_df.shape

(294588, 5)

In [8]:
cc_details_df.shape

(984, 5)

In [9]:
# Creating a new dataframes to ensure source DFs are not affected by any transformations
trans_ = transaction_details_df
cc_ = cc_details_df

In [10]:
# Confirming that we have the data for credit cards that were used in the transactions
all_cards = cc_['credit_card']
trans_['credit_card'].isin(all_cards).unique()

array([ True])

## 1. Adding the credit limit information to the dataframe with transaction amount information

Here, we will first test a method to get the credit limit of a credit card used in a particular transaction. Then, we will use a for loop to repeat this process for all transactions. And finally, we will add this information to the trans_ dataframe.

In [11]:
# Output variable is a list
limit = []

# First we get the credit card number
credit_card_num_t = trans_['credit_card'].values[0]

# Now, we can match the credit card number with the credit card limit
for k in range(cc_.shape[0]):
    if credit_card_num_t == cc_['credit_card'].values[k]:
        limit_t = cc_['credit_card_limit'].values[k]
        limit.append(limit_t)

In [12]:
limit

[20000]

In [13]:
# Lets repeat this process for all transactions

# These are the credit card numbers for every transaction
trans_cred_values = trans_['credit_card'].values

# These are all the credit card numbers for which we have data
cc_values = cc_['credit_card'].values

#These are all the corresponding credit card limits
cc_limit_values = cc_['credit_card_limit'].values

limit_ = []
for j in range(trans_.shape[0]):
    # for every row in trans_
    # Take the credit card number
    credit_card_num_j = trans_cred_values[j]
    
    #Take the credit limit
    for k in range(cc_.shape[0]):
        if credit_card_num_j == cc_values[k]:
            limit_jk = cc_limit_values[k]
            limit_.append(limit_jk)

In [14]:
# Adding the credit limits to the trans_ dataframe
trans_['credit_limit'] = limit_

## 2. Calculating the percentage of credit limit consumed in a particular transaction

In [15]:
trans_['perc_of_credit_limit'] = trans_['transaction_dollar_amount']/trans_['credit_limit']*100

In [16]:
trans_['perc_of_credit_limit'].describe()

count    294588.000000
mean          0.660865
std           1.310663
min           0.000029
25%           0.233966
50%           0.411217
75%           0.670922
max          49.660000
Name: perc_of_credit_limit, dtype: float64

## 3. Deriving the latitudes and longitudes of places where the credit cards were issued and adding this information to the transactions dataframe

In this part, we will take the following steps:
1. Create a column in cc_ dataframe where will store the addresses
2. Extract all unique addresses into a new dataframe, 'ua_df'
3. Run a geocoder to get the latitudes and longitudes of all unique addresses
4. Add these coordinates to the cc_ dataframe just like we added credit limits from cc_ to trans_
5. Create a new dataframe, 'empty_coord_df', out of cc_ such it contains all credit cards that were assigned NaN coordinates
6. Run a geocoder to get the latitudes and longitudes of all zipcodes in the dataframe with NaN coordinates
7. Drop rows for which coordinates are still NaN values
8. Create a new dataframe, 'old_coord_df', out of cc_ that contains all credit cards that did not have NaN coordinates
9. Merge 'old_coord_df' and 'empty_coord_df' to create new_cc dataframe
10. Add the credit card coordinates from new_cc to trans_ just like we added credit limits from cc_ to trans_

In [17]:
# Since all the states are represented by their postal codes, we will need a dictionary to convert them all back to their names
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}
    
# invert the dictionary
abbrev_to_us_state = dict(map(reversed, us_state_to_abbrev.items()))

In [18]:
# Here we are creating a list for all addresses in cc_
# Lists of zipcodes, cities, and state postal codes
zips = cc_['zipcode'].values
cities = cc_['city'].values
state_code = cc_['state'].values

#List of addresses
city_state_ = []


for j in range(cc_.shape[0]):
    city_j = str(cities[j])
    # Converting postal state code to state name
    state_j = str(abbrev_to_us_state[state_code[j]])
    zips_j = str(zips[j])
    address_j = city_j + ', ' + state_j + ' - ' + zips_j +'.'
    city_state_.append(address_j)
    

In [19]:
# Adding the list we created to the cc_
cc_['Address'] = city_state_

In [20]:
# Creating the Nominatim object and setting a delay. You can choose to keep the delay at 1 second as well. The code will still function the same way
geolocator_08 = Nominatim(user_agent='Digi_98')
geocode = RateLimiter(geolocator_08.geocode, min_delay_seconds=2)

In [21]:
# Creating a dataframe with all unique addresses
unique_address = set(city_state_)

ua_df = pd.DataFrame(unique_address, columns=['Address'])

In [22]:
# Getting the location objects for all unique address that will contain coordinates and various other identifiers
ua_df['location'] = ua_df['Address'].apply(geolocator_08.geocode)

In [23]:
# Getting the latitudes and longitudes
ua_df['latitude'] = ua_df['location'].apply(lambda x: x.latitude if x != None else None)
ua_df['longitude'] = ua_df['location'].apply(lambda x: x.longitude if x != None else None)

In [24]:
ua_df

Unnamed: 0,Address,location,latitude,longitude
0,"Jackson, New Hampshire - 3846.","(Jackson, Carroll County, New Hampshire, 03846...",44.144276,-71.181107
1,"Seattle, Washington - 98060.",,,
2,"Knoxville, Pennsylvania - 16928.","(Knoxville, Tioga County, Pennsylvania, 16928,...",41.957293,-77.438872
3,"New York, New York - 10001.","(City of New York, New York, United States, (4...",40.712728,-74.006015
4,"Indianapolis, Indiana - 46201.","(Indianapolis, Marion County, Indiana, United ...",39.768333,-86.158350
...,...,...,...,...
119,"Charleston, Maine - 4422.","(Charleston, Penobscot County, Maine, 04422, U...",45.085062,-69.040595
120,"Lafayette, New Jersey - 7848.",,,
121,"Chicago, Illinois - 60290.","(Chicago, Cook County, Illinois, United States...",41.875562,-87.624421
122,"Greensboro, Vermont - 5841.","(5841, Vermont Route 16, Wheelock, Caledonia C...",44.611439,-72.207900


In [25]:
# Adding all coordinates to cc_

# These are the addresses for every credit card
cc_address_values = cc_['Address'].values

# These are all the unique addresses for which we have data
ua_adress_values = ua_df['Address'].values

#These are all the corresponding latitudes
ua_lat_values = ua_df['latitude'].values

#These are all the corresponding longitudes
ua_lon_values = ua_df['longitude'].values

lat_ = []
lon_ = []

for j in range(cc_.shape[0]):
    # for every row in cc_
    # Take the address
    address_j = cc_address_values[j]
    
    #Take the latitude and longitude
    for k in range(ua_df.shape[0]):
        if address_j == ua_adress_values[k]:
            lat_jk = ua_lat_values[k]
            lon_jk = ua_lon_values[k]
            lat_.append(lat_jk)
            lon_.append(lon_jk)

In [26]:
lat_

[41.33617,
 40.2464593,
 42.1945465,
 38.8712074,
 40.7127281,
 42.5542347,
 40.4416941,
 nan,
 43.1763512,
 34.0536909,
 43.2309198,
 40.2464593,
 43.1763512,
 nan,
 43.1763512,
 43.1763512,
 nan,
 43.1763512,
 nan,
 40.2464593,
 41.5394353,
 43.1763512,
 43.1763512,
 40.7392018,
 40.2464593,
 43.1763512,
 43.1763512,
 43.1763512,
 43.1763512,
 27.7635302,
 42.3723379,
 40.6345309,
 42.4153739,
 43.240451,
 42.4153739,
 nan,
 27.7639145,
 42.118675,
 43.1763512,
 43.1763512,
 40.7392018,
 41.8755616,
 nan,
 40.2464593,
 34.0536909,
 43.1763512,
 41.33617,
 43.1763512,
 43.1763512,
 40.2464593,
 40.4416941,
 44.79020125863546,
 40.2464593,
 40.4416941,
 43.1763512,
 39.100105,
 43.1763512,
 42.2125871,
 35.4729886,
 43.1763512,
 43.1763512,
 35.5939325,
 37.7567819,
 39.1361859,
 43.1763512,
 43.0828444,
 43.1763512,
 43.1763512,
 42.3723379,
 nan,
 38.1523268,
 43.0828444,
 40.4416941,
 43.1763512,
 nan,
 42.4153739,
 43.1763512,
 40.7127281,
 42.8867166,
 43.1763512,
 43.1763512,
 40

In [27]:
lon_

[-75.9632636,
 -80.2114472,
 -71.8358095,
 -80.5937044,
 -74.0060152,
 -77.4724875,
 -79.9900861,
 nan,
 -72.0969498,
 -118.242766,
 -76.3001887,
 -80.2114472,
 -72.0969498,
 nan,
 -72.0969498,
 -72.0969498,
 nan,
 -72.0969498,
 nan,
 -80.2114472,
 -85.5394484,
 -72.0969498,
 -72.0969498,
 -89.0164626,
 -80.2114472,
 -72.0969498,
 -72.0969498,
 -72.0969498,
 -72.0969498,
 -97.4033191,
 -73.3678063,
 -76.5888514,
 -71.1564428,
 -75.883942,
 -71.1564428,
 nan,
 -98.2388953,
 -72.546951,
 -72.0969498,
 -72.0969498,
 -89.0164626,
 -87.6244212,
 nan,
 -80.2114472,
 -118.242766,
 -72.0969498,
 -75.9632636,
 -72.0969498,
 -72.0969498,
 -80.2114472,
 -79.9900861,
 -72.2964057457816,
 -80.2114472,
 -79.9900861,
 -72.0969498,
 -94.5781416,
 -72.0969498,
 -74.5693201,
 -97.5170536,
 -72.0969498,
 -72.0969498,
 -105.223896,
 -81.1742659,
 -76.5490844,
 -72.0969498,
 -76.3771554,
 -72.0969498,
 -72.0969498,
 -73.3678063,
 nan,
 -81.447892,
 -76.3771554,
 -79.9900861,
 -72.0969498,
 nan,
 -71.156442

In [28]:
cc_['latitude'] = lat_
cc_['longitude'] = lon_

In [29]:
cc_.head()

Unnamed: 0,credit_card,city,state,zipcode,credit_card_limit,Address,latitude,longitude
0,1280981422329509,Dallas,PA,18612,6000,"Dallas, Pennsylvania - 18612.",41.33617,-75.963264
1,9737219864179988,Houston,PA,15342,16000,"Houston, Pennsylvania - 15342.",40.246459,-80.211447
2,4749889059323202,Auburn,MA,1501,14000,"Auburn, Massachusetts - 1501.",42.194547,-71.835809
3,9591503562024072,Orlando,WV,26412,18000,"Orlando, West Virginia - 26412.",38.871207,-80.593704
4,2095640259001271,New York,NY,10001,20000,"New York, New York - 10001.",40.712728,-74.006015


In [30]:
# Creating a dataframe that contains only empty coordinate credit cards

empty_coord_df = cc_[cc_['longitude'].isna()]

no_lon_cc = empty_coord_df['credit_card'].values

In [31]:
empty_coord_df.head()

Unnamed: 0,credit_card,city,state,zipcode,credit_card_limit,Address,latitude,longitude
7,7482288151831611,Birmingham,NJ,8011,4000,"Birmingham, New Jersey - 8011.",,
13,8536914250563809,Colorado Springs,CO,80509,14000,"Colorado Springs, Colorado - 80509.",,
16,2216132730528773,San Francisco,CA,94101,18000,"San Francisco, California - 94101.",,
18,9548629685194612,Columbus,NJ,8022,7000,"Columbus, New Jersey - 8022.",,
35,2925559987432581,Tacoma,WA,98401,10000,"Tacoma, Washington - 98401.",,


In [32]:
empty_coord_df.reset_index(inplace=True)

In [33]:
empty_coord_df.drop('index', axis = 1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty_coord_df.drop('index', axis = 1, inplace=True)


In [34]:
# Creating a pgeocode object
nomi = pgeocode.Nominatim('us')

In [35]:
# Getting all coordinates

mt_zips = empty_coord_df['zipcode'].values
mt_lat_cc = []
mt_long_cc = []

for j in range(empty_coord_df.shape[0]):
    zip_j = str(mt_zips[j])
    query_j = nomi.query_postal_code(zip_j)
    lat_j = query_j['latitude']
    long_j = query_j['longitude']
    mt_lat_cc.append(lat_j)
    mt_long_cc.append(long_j)

In [36]:
# Adding coordinates to dataframe

empty_coord_df['latitude'] = mt_lat_cc
empty_coord_df['longitude'] = mt_long_cc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty_coord_df['latitude'] = mt_lat_cc
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty_coord_df['longitude'] = mt_long_cc


In [37]:
empty_coord_df.head()

Unnamed: 0,credit_card,city,state,zipcode,credit_card_limit,Address,latitude,longitude
0,7482288151831611,Birmingham,NJ,8011,4000,"Birmingham, New Jersey - 8011.",,
1,8536914250563809,Colorado Springs,CO,80509,14000,"Colorado Springs, Colorado - 80509.",,
2,2216132730528773,San Francisco,CA,94101,18000,"San Francisco, California - 94101.",,
3,9548629685194612,Columbus,NJ,8022,7000,"Columbus, New Jersey - 8022.",,
4,2925559987432581,Tacoma,WA,98401,10000,"Tacoma, Washington - 98401.",47.2537,-122.4443


In [38]:
empty_coord_df['latitude'].isna().sum()

48

In [39]:
# Creating a copy of the empty_coord_df to be on the safer side

mt_df_2 = empty_coord_df

In [40]:
# Since we have a copy of empty_coord_df as mt_df_2, we can drop rows with empty coords from empty_coord_df
empty_coord_df = empty_coord_df[empty_coord_df['latitude'].isna() == False]

In [41]:
empty_coord_df.reset_index(inplace=True)
empty_coord_df.drop('index', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty_coord_df.drop('index', axis=1, inplace=True)


In [42]:
empty_coord_df.head()

Unnamed: 0,credit_card,city,state,zipcode,credit_card_limit,Address,latitude,longitude
0,2925559987432581,Tacoma,WA,98401,10000,"Tacoma, Washington - 98401.",47.2537,-122.4443
1,8534199181434464,Des Moines,IA,50301,20000,"Des Moines, Iowa - 50301.",41.6727,-93.5722
2,3801374660832282,Fort Worth,TX,76101,20000,"Fort Worth, Texas - 76101.",32.7714,-97.2915
3,3253141560871065,Cincinnati,OH,45201,30000,"Cincinnati, Ohio - 45201.",39.1668,-84.5382
4,7011626867998686,Cincinnati,OH,45201,7000,"Cincinnati, Ohio - 45201.",39.1668,-84.5382


In [43]:
# This dataframe will have all credit cards for which we had coordinates from the first attempt

old_coord_df = cc_[cc_['latitude'].isna()==False]

In [44]:
# Creating a new dataframe by merging old_coord_df and empty_coord_df
new_cc = pd.concat([old_coord_df,empty_coord_df])

In [45]:
# Creating lists of latitudes and longitudes of registration places of credit cards used in transactions

new_cc_list = new_cc['credit_card'].values

new_cc_lat = new_cc['latitude'].values
new_cc_lon = new_cc['longitude'].values

added_lat = []
added_lon = []
trans_cc_list = trans_['credit_card'].values

for j in range(trans_.shape[0]):
    #for every row in trans_
    # We take the credit card number
    trans_cc_j = trans_cc_list[j]
    # Then we check if the credit card number is in new_cc_list
    if trans_cc_j in new_cc_list:
        # We will now match the latitudes and longitudes
        for k in range(new_cc.shape[0]):
            if trans_cc_j == new_cc_list[k]:
                lat_jk = new_cc_lat[k]
                lon_jk = new_cc_lon[k]
                added_lat.append(lat_jk)
                added_lon.append(lon_jk)
        # If the credit card number is not in list, we will append np.NaN. These rows will later be deleted
    else:
        added_lat.append(np.NaN)
        added_lon.append(np.NaN)

In [46]:
# Adding the information to the trans_ dataframe
trans_['credit_card_lat'] = added_lat
trans_['credit_card_lon'] = added_lon

In [47]:
trans_[trans_['credit_card_lat'].isna()]

Unnamed: 0,credit_card,date,transaction_dollar_amount,Long,Lat,credit_limit,perc_of_credit_limit,credit_card_lat,credit_card_lon
1187,1087468642191606,2015-08-04 15:56:40,83.40,-121.792036,47.485052,3000,2.780000,,
1188,1087468642191606,2015-09-03 19:04:22,53.93,-121.857773,47.373674,3000,1.797667,,
1189,1087468642191606,2015-08-15 23:32:06,27.09,-121.874610,47.444409,3000,0.903000,,
1190,1087468642191606,2015-09-10 22:48:11,170.83,-121.812252,47.511548,3000,5.694333,,
1191,1087468642191606,2015-08-07 19:14:08,75.21,-121.832278,47.434281,3000,2.507000,,
...,...,...,...,...,...,...,...,...,...
290911,9836548369808504,2015-09-21 01:51:11,81.76,-121.827220,47.421394,20000,0.408800,,
290912,9836548369808504,2015-09-30 18:48:32,52.15,-121.852459,47.495084,20000,0.260750,,
290913,9836548369808504,2015-09-07 16:37:57,109.44,-121.852677,47.407030,20000,0.547200,,
290914,9836548369808504,2015-10-15 17:31:42,70.08,-121.839381,47.424876,20000,0.350400,,


In [48]:
# Creating a dataframe that does not have NaN values for coordinates of registration places
dist_imp_trans_ = trans_[trans_['credit_card_lat'].isna() == False]

## 4. Calculating the distance between place of transaction and place of registration

In [49]:
#This is the credit card registration coordinate
test_cc_coord = (dist_imp_trans_['credit_card_lat'][0], dist_imp_trans_['credit_card_lon'][0])

#This is the transaction coordinate
test_trans_coord= (dist_imp_trans_['Lat'][0], dist_imp_trans_['Long'][0])

#This the distance between the two
test_distance = geopy.distance.geodesic(test_cc_coord, test_trans_coord).km

In [50]:
test_distance

3.933043641442364

In [51]:
# Now we will repeat the process for all transactions

# Here we are just collecting all required values

dit_cc_lat = dist_imp_trans_['credit_card_lat'].values
dit_cc_lon = dist_imp_trans_['credit_card_lon'].values

dit_t_lat = dist_imp_trans_['Lat'].values
dit_t_lon = dist_imp_trans_['Long'].values

In [52]:
# Applying the geopy.distance.geodesic query

distances = []
for j in range(dist_imp_trans_.shape[0]):
    cc_coord_j = (dit_cc_lat[j], dit_cc_lon[j])
    trans_coord_j = (dit_t_lat[j], dit_t_lon[j])
    distance_j = geopy.distance.geodesic(cc_coord_j, trans_coord_j).km
    distances.append(distance_j)

In [53]:
# Adding the information gathered to the dataframe

dist_imp_trans_['Distance_from_registration_place'] = distances

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dist_imp_trans_['Distance_from_registration_place'] = distances


In [54]:
dist_imp_trans_['Distance_from_registration_place'].describe()

count    280000.000000
mean        173.652183
std        1285.129625
min           0.025687
25%           4.484742
50%           6.402246
75%           8.172051
max       19743.725955
Name: Distance_from_registration_place, dtype: float64

In [55]:
dist_imp_trans_.head()

Unnamed: 0,credit_card,date,transaction_dollar_amount,Long,Lat,credit_limit,perc_of_credit_limit,credit_card_lat,credit_card_lon,Distance_from_registration_place
0,1003715054175576,2015-09-11 00:32:40,43.78,-80.174132,40.26737,20000,0.2189,40.246459,-80.211447,3.933044
1,1003715054175576,2015-10-24 22:23:08,103.15,-80.19424,40.180114,20000,0.51575,40.246459,-80.211447,7.511141
2,1003715054175576,2015-10-26 18:19:36,48.55,-80.211033,40.313004,20000,0.24275,40.246459,-80.211447,7.389253
3,1003715054175576,2015-10-22 19:41:10,136.18,-80.174138,40.290895,20000,0.6809,40.246459,-80.211447,5.866592
4,1003715054175576,2015-10-26 20:08:22,71.82,-80.23872,40.166719,20000,0.3591,40.246459,-80.211447,9.153604


## 5. Normalising the data and building a model having a sigmoid layer at the end

Here, we will first start with the date and time column to convert it into epoch timestamp. Then, we will use scikit to normalise the data.

In [56]:
dist_imp_trans_.dtypes

credit_card                           int64
date                                 object
transaction_dollar_amount           float64
Long                                float64
Lat                                 float64
credit_limit                          int64
perc_of_credit_limit                float64
credit_card_lat                     float64
credit_card_lon                     float64
Distance_from_registration_place    float64
dtype: object

In [57]:
dist_imp_trans_['date'] = pd.to_datetime(dist_imp_trans_['date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dist_imp_trans_['date'] = pd.to_datetime(dist_imp_trans_['date'])


In [58]:
dist_imp_trans_['date'] = dist_imp_trans_['date'].apply(lambda x: x.timestamp())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dist_imp_trans_['date'] = dist_imp_trans_['date'].apply(lambda x: x.timestamp())


In [59]:
pipeline = Pipeline([('normalizer', Normalizer()),
                     ('scaler', MinMaxScaler())])

In [60]:
X = np.array(dist_imp_trans_)
pipeline.fit(X)
X_transform = pipeline.transform(X)

## 6. Fitting the model and making predictions
First, we will split the data for training and testing. Then, we will create a tensorflow model, compile it, and fit it on the training data. Then, we will make predictions out of both, testing and training data.

In [61]:
X_train, X_test = train_test_split(X_transform, test_size=0.33, random_state=42)

In [62]:
BATCH_SIZE = 10000
EPOCHS = 200

In [63]:
autoencoder = tf.keras.Sequential([
    tf.keras.layers.Dense(10),
    tf.keras.layers.Dense(64),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

autoencoder.compile(optimizer="adam",
                    loss="mse",
                    metrics=["acc"])





In [64]:
history = autoencoder.fit(X_train,
                          X_train, 
                          shuffle=True, 
                          epochs=EPOCHS, 
                          batch_size=BATCH_SIZE,
                          validation_data=(X_test, X_test))

Epoch 1/200


Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200


Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200


Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200


Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


In [65]:
# Making predictions on test data
reconstructions = autoencoder.predict(X_test)



In [66]:
# Making a dataframe out of predictions
fraud_or_no = pd.DataFrame(reconstructions, columns=['Prediction'])

In [67]:
fraud_or_no.reset_index(inplace=True)
fraud_or_no["index"] = fraud_or_no["index"] + 1

In [68]:
# Since we had a sigmoid layer at the end, the outliers or anomalies will have a value of over 0.5
nl_2 = []

for j in range(reconstructions.shape[0]):
    value = float(reconstructions[j])
    if value >= 0.5:
        nl_2.append(1)
    else:
        nl_2.append(0)

In [69]:
# Enumerating the anomalies

fraud_or_no['Approx'] = nl_2
fraud_or_no.groupby(['Approx']).count()

Unnamed: 0_level_0,index,Prediction
Approx,Unnamed: 1_level_1,Unnamed: 2_level_1
0,92382,92382
1,18,18


In [70]:
# Making predictions on training data

recons_2 = autoencoder.predict(X_train)



In [71]:
# Making a dataframe out of prediction
fon_2 = pd.DataFrame(recons_2, columns=['Prediction'])

fon_2.reset_index(inplace=True)

fon_2['index'] = fon_2['index'] + 1

In [72]:
# Since we had a sigmoid layer at the end, the outliers or anomalies will have a value of over 0.5

new_list = []

for j in range(recons_2.shape[0]):
    value = float(recons_2[j])
    if value >= 0.5:
        new_list.append(1)
    else:
        new_list.append(0)

In [73]:
# Enumerating the anomalies

fon_2['Approx'] = new_list

fon_2.groupby(['Approx']).count()

Unnamed: 0_level_0,index,Prediction
Approx,Unnamed: 1_level_1,Unnamed: 2_level_1
0,187572,187572
1,28,28
