## Bonus Task Data aggregation
1. Check whether it is a foreign transaction: 

2. Aggregated several features:
    * Previous daily average amount
    * Previous number of transactions per day
    * Number of transactions today before current one
    * Total amount today before current transaction
    * Number of previous transactions which were using the same currency as this transaction
    

In [1]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline



### Please notice that:
Because it takes a long time to get these features, I just saved them in a'csv' file, and you don't need to run the coede below.

In [3]:
# Check whether it is a foreign transaction
cleaned_df['foreign_transaction'] = cleaned_df.apply(lambda x: x['issuercountrycode']!=x['shoppercountrycode'],axis=1)

In [3]:
# Sort the Dataframe 
cleaned_df = cleaned_df.sort_values('creation_date')

pre_avg = []
prev_daily_num = []
tdy_num = []
tdy_sum = []
cry_prev_count = []

for i in range(len(cleaned_df)):
    row = cleaned_df.iloc[i]
    # All previous transactions
    previous_data = cleaned_df[0:i]
    # Match current card id
    previous_data = previous_data[previous_data.card_id==row.card_id]
    previous_average_daily = np.mean(previous_data['amount'].groupby(previous_data.creation_day).mean())
    
    if len(previous_data) == 0:
        previous_average = 0
        previous_daily_num = 0
        currency_previous_count = 0
    else:
        previous_average = np.mean(previous_data.amount)
        previous_daily_num =  np.mean(previous_data['amount'].groupby(previous_data.creation_day).count())
        currency_previous_count = sum(previous_data.currencycode==row.currencycode)
    # Match today's transaction
    today_transaction = previous_data[previous_data.creation_day==row.creation_day]
    if len(today_transaction)==0:
        today_num = 0
        today_sum = 0
        
    else:
        today_num = len(today_transaction)
        today_sum = sum(today_transaction.amount)
    
    # Append the result in list accordingly
    pre_avg.append(previous_average)
    prev_daily_num.append(previous_daily_num)
    tdy_num.append(today_num)
    tdy_sum.append(today_sum)
    cry_prev_count.append(currency_previous_count)
    
    if i%2000 == 0 :
        print 'finished %d' %i

# Add these colunms as new features
cleaned_df['previous_average'] = pre_avg
cleaned_df['previous_daily_num'] = prev_daily_num
cleaned_df['today_num'] = tdy_num
cleaned_df['today_sum'] = tdy_sum
cleaned_df['currency_prev_count'] = cry_prev_count


### You start here!
Here we load in 'aggregated data', which is a pre-saved file.
You can downloade the file here: https://drive.google.com/file/d/0B5YbrNDkPK3nTEFMZkJiQXdFRUU/view?usp=sharing

In [2]:
cleaned_df = pd.read_csv('aggregated_data')

In [3]:
# Again, transform numerical variables
from sklearn.preprocessing import StandardScaler
cleaned_df['previous_average'] = StandardScaler().fit_transform(cleaned_df['previous_average'].values.reshape(-1, 1))
cleaned_df['today_sum'] = StandardScaler().fit_transform(cleaned_df['today_sum'].values.reshape(-1, 1))

In [4]:
# Drop the features we don't need
data_aggregated = cleaned_df.drop(['Unnamed: 0','txid','bookingdate','amount','simple_journal','converted_amount','creation_date','creationdate','card_id','ip_id','mail_id'],axis=1)

In [5]:
def dense_encoding(data,column,threshold):
    count = dict(data[column].value_counts())
    mapping = {}
    for id in count.keys():
        if count[id]>threshold:
            mapping[id] = id
        else:
            mapping[id] = 'others'
    data[column] = data[column].map(mapping)
    return data

data_aggregated = dense_encoding(data_aggregated,'bin',2)

In [6]:
# Encode for ordinal feature: creation day
date_mapping = {label:idx for idx,label in enumerate(data_aggregated['creation_day'].unique())}
data_aggregated['creation_day'] = data_aggregated['creation_day'].map(date_mapping)
# Encode for categorical variables
columns = list(data_aggregated.columns)
columns.remove('norm_amount')
columns.remove('norm_converted_amount')
columns.remove('label')
columns.remove('creation_day')
columns.remove('creation_month')
columns.remove('currency_prev_count')
columns.remove('today_sum')
columns.remove('today_num')
columns.remove('previous_daily_num')
columns.remove('previous_average')
columns.remove('foreign_transaction')

# OneHot Encoding
encoded_data = pd.get_dummies(data_aggregated,columns=columns,dummy_na=True)

## Training data X and label y
X = encoded_data.ix[:,encoded_data.columns !='label']
y = encoded_data.ix[:,encoded_data.columns =='label'].label

In [7]:
X = X.values
y = y.values

In [8]:
from sklearn.utils import shuffle
X, y = shuffle(X,y)

### Build two models: ramdon forest and logistic regression
1. Build two models
2. Using 10-folds cross validation to evaluate our results
3. Compare with the previous results

In [9]:
from utilities import ten_fold_CV_eval
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
forest = RandomForestClassifier(n_estimators=250, n_jobs=4)
lr = LogisticRegression(C=400, penalty='l1')

In [10]:
avg_score, avg_acc, avg_precsion, avg_recall, avg_f1,avg_fhalf = ten_fold_CV_eval(forest,X,y, r= 0.01,t = 0.4)

Start 0 fold:
Finished SMOTE!
23661 9 32 3
Start 1 fold:
Finished SMOTE!
23661 8 34 1
Start 2 fold:
Finished SMOTE!
23661 8 28 7
Start 3 fold:
Finished SMOTE!
23663 6 30 5
Start 4 fold:
Finished SMOTE!
23663 6 29 6
Start 5 fold:
Finished SMOTE!
23662 7 29 5
Start 6 fold:
Finished SMOTE!
23662 7 32 2
Start 7 fold:
Finished SMOTE!
23664 5 32 2
Start 8 fold:
Finished SMOTE!
23661 8 31 3
Start 9 fold:
Finished SMOTE!
23665 4 31 3


In [17]:
#print the results of aggregated model and non-aggregated model respectively for Random Forest. 
print ("average auc score of aggregated model is:%0.3f,    average auc score of non-aggregated model is:%0.3f" % (avg_score,0.890))
print ("average accuracy of aggregated model is:%0.3f,     average accuracy of non-aggregated model is%0.3f" % (avg_acc,0.999))
print ("average precision of aggregated model is:%0.3f,    average precision of non-aggregated model is:%0.3f" % (avg_precsion,0.434))
print ("average recall of aggregated model is:%0.3f,       average recall of non-aggregated model is:%0.3f" % (avg_recall,0.107))
print ("average F1 score of aggregated model is:%0.3f,     average F1 score of non-aggregated modelis:%0.3f" % (avg_f1,0.168))
print ("average F0.5 score of aggregated model is:%0.3f,   average F0.5 score of non-aggregated model is:%0.3f" % (avg_fhalf,0.261))

average auc score of aggregated model is:0.878,    average auc score of non-aggregated model is:0.890
average accuracy of aggregated model is:0.998,     average accuracy of non-aggregated model is0.999
average precision of aggregated model is:0.341,    average precision of non-aggregated model is:0.434
average recall of aggregated model is:0.107,       average recall of non-aggregated model is:0.107
average F1 score of aggregated model is:0.161,     average F1 score of non-aggregated modelis:0.168
average F0.5 score of aggregated model is:0.234,   average F0.5 score of non-aggregated model is:0.261


In [11]:
avg_score1, avg_acc1, avg_precsion1, avg_recall1, avg_f11,avg_fhalf1 = ten_fold_CV_eval(lr,X,y, r = 0.01)

Start 0 fold:
Finished SMOTE!
23647 23 27 8
Start 1 fold:
Finished SMOTE!
23641 28 30 5
Start 2 fold:
Finished SMOTE!
23635 34 29 6
Start 3 fold:
Finished SMOTE!
23627 42 31 4
Start 4 fold:
Finished SMOTE!
23632 37 29 6
Start 5 fold:
Finished SMOTE!
23637 32 28 6
Start 6 fold:
Finished SMOTE!
23642 27 30 4
Start 7 fold:
Finished SMOTE!
23640 29 27 7
Start 8 fold:
Finished SMOTE!
23644 25 27 7
Start 9 fold:
Finished SMOTE!
23638 31 28 6


In [19]:
avg_score1, avg_acc1, avg_precsion1, avg_recall1, avg_f11,avg_fhalf1

(0.91471219365233092,
 0.99749405217538778,
 0.16483546746174724,
 0.1710924369747899,
 0.16729735382704175,
 0.16567571789337185)

In [21]:
#print the results of aggregated model and non-aggregated model respectively for Logistic regression. 
print ("average auc score of aggregated model is:%0.3f,    average auc score of non-aggregated model is:%0.3f" % (avg_score1,0.913))
print ("average accuracy of aggregated model is:%0.3f,     average accuracy of non-aggregated model is%0.3f" % (avg_acc1,0.998))
print ("average precision of aggregated model is:%0.3f,    average precision of non-aggregated model is:%0.3f" % (avg_precsion1,0.176))
print ("average recall of aggregated model is:%0.3f,       average recall of non-aggregated model is:%0.3f" % (avg_recall1,0.153))
print ("average F1 score of aggregated model is:%0.3f,     average F1 score of non-aggregated modelis:%0.3f" % (avg_f11,0.161))
print ("average F0.5 score of aggregated model is:%0.3f,   average F0.5 score of non-aggregated model is:%0.3f" % (avg_fhalf1,0.169))

average auc score of aggregated model is:0.915,    average auc score of non-aggregated model is:0.913
average accuracy of aggregated model is:0.997,     average accuracy of non-aggregated model is0.998
average precision of aggregated model is:0.165,    average precision of non-aggregated model is:0.176
average recall of aggregated model is:0.171,       average recall of non-aggregated model is:0.153
average F1 score of aggregated model is:0.167,     average F1 score of non-aggregated modelis:0.161
average F0.5 score of aggregated model is:0.166,   average F0.5 score of non-aggregated model is:0.169


### Discussion
The comparison above shows that there is no significant difference between aggregated data and previous original data.
For logistic model, some of the metrics are slightly better than previous model.