In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)

# Anonymisation

Here I anonymise the synethic data from our model.


## TOC
 * [Preliminary](#Preliminary)
 * [Transaction Details](#Transaction-Details)
   * [Amount](#Amount)
   * [Time](#Time)
 * [Customer Details](#Customer-Details)
   * [Credit Card Number](#Credit-Card-Number)
   * [Person Details](#Person-Details)
   * [Other](#Other)
 * [Merchant Details](#Merchant-Details)
 * [Location Details](#Location-Details)

In [2]:
# Load the data
training_data=pd.read_csv("data/synthetic_train.csv",index_col=0)
test_data=pd.read_csv("data/synthetic_train.csv",index_col=0)
full_data=pd.concat([training_data, test_data])

sample_df=full_data.sample(n=10)

  mask |= (ar1 == a)


In [3]:
print(full_data.shape)
full_data.dtypes

(2593350, 22)


trans_date_trans_time     object
cc_num                     int64
merchant                  object
category                  object
amt                      float64
first                     object
last                      object
gender                    object
street                    object
city                      object
state                     object
zip                        int64
lat                      float64
long                     float64
city_pop                   int64
job                       object
dob                       object
trans_num                 object
unix_time                  int64
merch_lat                float64
merch_long               float64
is_fraud                   int64
dtype: object

In [4]:
# Sample of the data
sample_df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
721095,2019-11-03 23:26:30,346273234529002,fraud_Hickle Group,shopping_pos,175.82,Donna,Moreno,F,32301 Albert River Suite 364,Ronceverte,WV,24970,37.7418,-80.4626,4575,Statistician,1991-10-22,17e359569e95c253b6583eb716bd8d5d,1351985190,38.212003,-81.2468,0
171532,2019-03-31 13:01:05,30118423745458,fraud_Abernathy and Sons,food_dining,29.87,Jared,Velazquez,M,01479 Murray Circle,Matawan,NJ,7747,40.4109,-74.238,30770,Drilling engineer,1993-04-29,e25b340bc6022b1b6acc09e46a0e66bd,1333198865,40.94722,-73.509474,0
28936,2019-01-18 00:12:26,377113842678100,fraud_Brekke and Sons,gas_transport,76.66,Billy,Gallagher,M,673 Delgado Burg,Greenwich,NJ,8323,39.4055,-75.3209,804,Insurance risk surveyor,1965-03-25,213a7fb1f3c39ed78704878b4f495ae8,1326845546,39.46803,-75.732957,0
568509,2019-08-30 04:50:49,3525590521269779,"fraud_Raynor, Feest and Miller",gas_transport,73.2,Scott,Fuller,M,861 Karen Common,Haw River,NC,27258,36.0424,-79.3242,6006,Paramedic,1984-07-20,58c2a806f45270cbef844e0d49e212fc,1346302249,35.64075,-78.62268,0
538936,2019-08-19 04:31:18,3538520143479972,fraud_Bashirian Group,shopping_net,2.16,Cassandra,Nunez,F,9572 Austin Forge Suite 612,Clay Center,OH,43408,41.5686,-83.3632,269,Insurance underwriter,1965-09-15,76c7f3cd2265d0daf41006104cf02136,1345350678,41.334376,-84.060964,0


## Preliminary

Convert `trans_date_trans_time` and `dob` into `datetime` objects.

In [5]:
full_data["trans_date_trans_time"]=pd.to_datetime(full_data["trans_date_trans_time"],format="%Y-%m-%d %H:%M:%S")
full_data["dob"]=pd.to_datetime(full_data["dob"],format="%Y-%m-%d")

sample_df["trans_date_trans_time"]=pd.to_datetime(sample_df["trans_date_trans_time"],format="%Y-%m-%d %H:%M:%S")
sample_df["dob"]=pd.to_datetime(sample_df["dob"],format="%Y-%m-%d")

**Cleaned Data**

I define the dataframe `clean_df` which will hold all the prepared data

In [6]:
clean_df=pd.DataFrame()
clean_df["is_fraud"]=sample_df["is_fraud"]

## Transcation Details
 * `trans_date_trans_time`
 * `amt`
 * `trans_num`
 * `unix_time`
 * `is_fraud`
 
Leave `is_fraud` untouched as it is the feature we will be fitting the model to.

In [7]:
# !pip install forex-python

In [8]:
import forex_python.converter as fx

### Amount

All `amt` values are in dollars, but the real data we have is from Europe so we need to account for the exchange rate. I implement the function `convert_currency` to perform this exchange and then create two columns `amount_USD` (a renaming of `amt`) and `amount_GBP` (value of `amt` in sterling).

In [9]:
def convert_currency(amount:float,unix_time:int,cur_currency:str,tar_currency) -> float:
    """
    Determine the value of an amount of one currency in another currency at a specified point in time
    
    PARAMS
    amount (float) - amount of current currency
    unix_time (int) - unix timestamp of exchange rate to use
    cur_currency (str) - three character code for current currency
    tar_currency (str) - three character code for target currency
    
    RETURNS
    float - amount of target currency
    """
    time=datetime.utcfromtimestamp(unix_time)
    exchange_rate=fx.get_rate(cur_currency,tar_currency,time)
    
    return round(amount*exchange_rate,2)

In [10]:
# Convert amount between currencies
def prepare_amount(df,cur_label,cur_currency="USD",tar_currency="GBP") -> pd.Series:
    tar_label="amount_{}".format(tar_currency)
    return df[[cur_label,"unix_time"]].apply(lambda x: convert_currency(x[cur_label],x["unix_time"],cur_currency,tar_currency),axis=1)

In [11]:
def quicker(df,cur_label,cur_currency="USD",tar_currency="GBP") -> pd.Series:
    df_local=df.copy()
    df_local["date"]=pd.to_datetime(df["trans_date_trans_time"].dt.date,format="%Y-%m-%d")
    
    # determine the exchange rate for each day
    exchange_rates=pd.DataFrame()
    exchange_rates["date"]=pd.to_datetime(df_local["date"].unique(),format="%Y-%m-%d")
    exchange_rates["unix_time"]=(exchange_rates["date"].astype("int64")/1000000000).astype(int)
    exchange_rates["rate"]=exchange_rates.apply(lambda x:convert_currency(1,x["unix_time"],cur_currency,tar_currency),axis=1)
    
    # merge dataframes
    df_local["date"]=df_local["date"].dt.date
    exchange_rates["date"]=exchange_rates["date"].dt.date
    df_merged=df_local[["date","amt"]].merge(exchange_rates[["date","rate"]],on="date",how="left")
    
    # calculated exchanged amounts
    tar_label="amount_{}".format(tar_currency)
    df_merged[tar_label]=df_merged.apply(lambda x:x["amt"]*x["rate"],axis=1)
    
    return df_merged[tar_label]

In [12]:
clean_df["amount_USD"]=sample_df["amt"].copy()
clean_df["amount_GBP"]=quicker(full_data[["trans_date_trans_time","amt"]],"amt","USD","GBP")

### Time

In the real data set time is given in seconds since the first transaction. I implement the function `standardise_time` which calculates the number of seconds between mid-night on the day of the first transaction and each transaction.

The full-time columns (`trans_date_trans_time` and `unix_time`) are still useful as you can use them to determine the year, month, day etc. of the transaction and thus may want to consider how to incorporate these into feature selection. 

In [13]:
def standardise_time(series) -> pd.Series:
    min_time=series.min().to_pydatetime()
    min_day=min_time.replace(second=0,minute=0,hour=0)
    return ((series-min_day).dt.total_seconds()).astype(int)

In [14]:
clean_df["unix_time"]=sample_df["unix_time"].copy()
clean_df["seconds_from_start"]=standardise_time(sample_df["trans_date_trans_time"])

In [15]:
clean_df.head(5)

Unnamed: 0,is_fraud,amount_USD,amount_GBP,unix_time,seconds_from_start
721095,0,175.82,135.3814,1351985190,26090790
171532,0,29.87,22.7012,1333198865,7304465
28936,0,76.66,59.0282,1326845546,1037546
568509,0,73.2,60.024,1346302249,20407849
538936,0,2.16,1.7712,1345350678,19456278


## Customer Details

**Account Details**
 * `cc_num`
 
**Person Details**
 * `first`
 * `last`
 * `gender`
 * Address (`street`, `city`, `state`, `zip`, `lat`, `long`)
  * *Owen is working on this as we want to maintain relationships between locations*. (See [below](#Location-Details))
 * `job`
 * `dob`
 
**Other**
 * `city_pop`

### Credit Card Number

Anonymise credit card number

In [16]:
def anonymise(series:pd.Series) -> pd.Series:
    return series.astype("category").cat.codes

In [17]:
clean_df["cc_id"]=anonymise(sample_df["cc_num"])
clean_df.head(1)

Unnamed: 0,is_fraud,amount_USD,amount_GBP,unix_time,seconds_from_start,cc_id
721095,0,175.82,135.3814,1351985190,26090790,3


### Personal Details

`first` and `last` contains the customers name, these need to be anonymised. I introduce the column `person_id` which contains a unique identifier for each person in the dataset (It is assumed each unique combination of `first`, `last`, `job` and `dob` is a unique person).

In [18]:
clean_df["person_id"]=anonymise(sample_df["first"]+"_"+sample_df["last"]+"_"+sample_df["job"]+"_"+sample_df["dob"].apply(lambda x: x.strftime('%Y-%m-%d')))

Keeping `gender` seems reasonable as it may be useful in the model fit. May be worth further investigation into whether there are notable differences in fraud rates/patterns between the genders.

In [19]:
clean_df["gender_id"]=anonymise(sample_df["gender"])

`job` could be useful for the model and does not necessarily need anonymising. I will leave it as is and likely try to group jobs during feature selection to reduce degrees of freedom.

In [30]:
clean_df["job_category"]=sample_df["job"]

`dob` should be useful for the model, I am going to discretise this data into the age of the customer at the time of the transaction.

In [21]:
def dob_to_age(df) -> pd.Series:
    return (df["trans_date_trans_time"]-df["dob"])//np.timedelta64(1,"Y")

In [22]:
clean_df["age"]=dob_to_age(sample_df[["dob","trans_date_trans_time"]])

Intuitively it seems likely that `city_pop` may be useful in fitting the model, so we want to keep it. However, it could be used to identify the city the customer is from, due to the precise nature of this value, thus I round it up to the nearest 1,000.

During feature selection, it is likely we would want to discretise/categorise `city_pop` values. This will introduce our prior knowledge and reduce the degrees of freedom which the model needs to consider.

In [23]:
clean_df["city_pop_round"]=np.ceil(sample_df["city_pop"]/1000).copy().astype(int)

In [31]:
clean_df.head(5)

Unnamed: 0,is_fraud,amount_USD,amount_GBP,unix_time,seconds_from_start,cc_id,person_id,gender_id,job_id,age,city_pop_round,merchant_id,merchant_category,job_category
721095,0,175.82,135.3814,1351985190,26090790,3,3,0,8,28,5,6,shopping_pos,Statistician
171532,0,29.87,22.7012,1333198865,7304465,1,5,1,2,25,31,0,food_dining,Drilling engineer
28936,0,76.66,59.0282,1326845546,1037546,4,0,1,4,53,1,3,gas_transport,Insurance risk surveyor
568509,0,73.2,60.024,1346302249,20407849,5,8,1,6,35,7,8,gas_transport,Paramedic
538936,0,2.16,1.7712,1345350678,19456278,6,2,0,5,53,1,1,shopping_net,Insurance underwriter


## Merchant Details
 * `merchant`
 * `category`
 * Location (`merch_lat`,`merch_long`)
   * *Owen is working on this as we want to maintain relationships between locations*. (See [below](#Location-Details))

`merchant` is the name of the merchant and should be anonymised. This is done by simple converting it into an id.

In [25]:
clean_df["merchant_id"]=anonymise(sample_df["merchant"])

`category` is the verbose category of the merchant. This does not necessarily need to be anonymised as it is not identifiable, but some pre-processing should be performed in order to reduce the degrees of freedom of the model.

In [26]:
clean_df["merchant_category"]=sample_df["category"]

In [27]:
clean_df.head(5)

Unnamed: 0,is_fraud,amount_USD,amount_GBP,unix_time,seconds_from_start,cc_id,person_id,gender_id,job_id,age,city_pop_round,merchant_id,merchant_category
721095,0,175.82,135.3814,1351985190,26090790,3,3,0,8,28,5,6,shopping_pos
171532,0,29.87,22.7012,1333198865,7304465,1,5,1,2,25,31,0,food_dining
28936,0,76.66,59.0282,1326845546,1037546,4,0,1,4,53,1,3,gas_transport
568509,0,73.2,60.024,1346302249,20407849,5,8,1,6,35,7,8,gas_transport
538936,0,2.16,1.7712,1345350678,19456278,6,2,0,5,53,1,1,shopping_net


## Location Details

There is location data for customers (`street`, `city`, `state`, `zip`, `lat`, `long`) and for merchants (`merch_lat`,`merch_long`).

*Owen is looking at this*

## Save File

In [28]:
def save_data(df:pd.DataFrame,file_path):
    df.to_csv(file_path)

In [29]:
# save_data("","data/prepared_syntheticg_data.csv")