# Revolut Financial Crime Challenge
## Home Task

# TASK 1 - Communication and SQL familiarity

#### Examine the following SQL query, and explain clearly and succinctly what it means. Will the query work? Explain why or why not. (15 points)

```SQL
WITH processed_users AS (
SELECT left(u.phone_country, 2) AS short_phone_country, u.id 
FROM users u
)
SELECT t.user_id, 
t.merchant_country, 
sum(t.amount / fx.rate / power(10, cd.exponent)) AS amount 
FROM transactions t
JOIN fx_rates fx ON (fx.ccy = t.currency AND fx.base_ccy = 'EUR')
JOIN currency_details cd ON cd.currency = t.currency
JOIN processed_users pu ON pu.id = t.user_id
WHERE t.source = 'GAIA'
AND pu.short_phone_country = t.merchant_country
GROUP BY t.user_id, t.merchant_country

ORDER BY amount DESC;```

<img src="Screenshot%202019-03-13%20at%2000.07.06.png" width="800" />

**Examine the following SQL query, and explain clearly and succinctly what it means:**

**Will the query work? Explain why or why not.**
___

The code above is not working due to this line -> **AND pu.short_phone_country = t.merchant_country**. Compared values are in different formats and that is why result is empty.

> pu.short_phone_country  -> **varchar(2)**, ex. HU

> t.merchant_country -> **varchar(3)**, ex. HUN


The solution for this mistake will be aligning Merchant country code to Phone country code by modifying the string:

> Instead of **AND pu.short_phone_country = t.merchant_country** should be **AND pu.short_phone_country = left(t."MERCHANT_COUNTRY",2)**

***
Additionally, calculation for exchange rate is wrong as well:

>Incorrect code - **sum(t."AMOUNT" / fx.rate / power(10, cd.exponent)) AS amount**

>Correct code - **sum(t."AMOUNT" * fx.rate / power(10, cd.exponent)) AS amount** 



<img src="Screenshot%202019-03-13%20at%2000.09.05.png" width="800" />

```SQL
WITH processed_users AS (
SELECT left(u."PHONE_COUNTRY", 2) AS 
short_phone_country, u."ID"
FROM users u)
SELECT t."USER_ID",
t."MERCHANT_COUNTRY",
sum(t."AMOUNT" * fx."rate" / power(10, cd.exponent)) AS amount
FROM transactions t
JOIN fx_rates fx ON (fx.ccy = t."CURRENCY" AND fx.base_ccy = 'EUR')
JOIN currency_details cd ON cd.ccy = t."CURRENCY"
JOIN processed_users pu ON pu."ID" = t."USER_ID"
WHERE t."SOURCE" = 'GAIA'
AND pu.short_phone_country = left(t."MERCHANT_COUNTRY",2)
GROUP BY t."USER_ID", t."MERCHANT_COUNTRY"
ORDER BY amount DESC; ```

##### Output result from query above:

<img src="Task%201%20results.png" width="800" />

# TASK 2 - Communication and SQL familiarity

#### Now it’s your turn! Write a query to identify users whose first transaction was a successful card payment over $10 USD equivalent (10 points)

### Correct SQL Query:
___

```SQL
SELECT *
FROM (
SELECT DISTINCT ON (tr."USER_ID")
	tr."USER_ID", tr."CURRENCY", tr."AMOUNT", 
	CASE WHEN fx.ccy = tr."CURRENCY" THEN tr."AMOUNT"*fx.rate / power(10, cd.exponent) END AS "AMOUNT_IN_USD",
	tr."CREATED_DATE" as "Date_of_First_Transaction"
FROM Public.fx_rates AS fx
INNER JOIN transactions as tr ON tr."CURRENCY" = fx.ccy
JOIN currency_details cd ON cd.ccy = tr."CURRENCY"
WHERE base_ccy = 'USD' 
    AND tr."TYPE" = 'CARD_PAYMENT' 
    AND tr."STATE" = 'COMPLETED' 
ORDER BY tr."USER_ID", tr."CREATED_DATE" ASC) T
WHERE "AMOUNT_IN_USD" >10;
```

The query is showing additional columns as a proof that first transaction was made above $10

<img src="Task2.png" width="800" />

### Other solution using Python and pandas library
___

In [1]:
#importing pandas library
import pandas as pd

In [2]:
#loading all csv files using pandas
currency_details = pd.read_csv('./currency_details.csv')
fx_rates = pd.read_csv('./fx_rates.csv')
transactions = pd.read_csv('./transactions.csv',index_col=0)

In [3]:
#Merging fx_rates and currency_details tables
fx_rates_exponent = pd.merge(fx_rates, currency_details, how='inner', left_on="ccy", right_on='currency')

In [4]:
#taking ex_rate for USD vs other currencies and dropping out unused columns
rates_in_usd = fx_rates_exponent[fx_rates_exponent['base_ccy']=='USD'].drop(['currency','iso_code','is_crypto','base_ccy'],axis=1)

#Merging transactions and rates_in_usd tables
merged_trans = pd.merge(transactions, rates_in_usd, how='inner', left_on='CURRENCY', right_on='ccy')

#Creating new column "Amount in USD" and applying function Amount * ex_rate / 10**exponent
merged_trans['Amount_in_USD'] = merged_trans['AMOUNT']*merged_trans['rate']/10**merged_trans['exponent']

#Sorting data by status Completed and by Card Payment
merged_trans = merged_trans[(merged_trans['STATE'] =="COMPLETED") & (merged_trans['TYPE'] == 'CARD_PAYMENT')]


In [None]:
merged_trans = merged_trans.sort_values(by = ['USER_ID','CREATED_DATE'],ascending=True ).drop_duplicates(subset = 'USER_ID', keep='first')
users_with_10USD_trans = merged_trans[merged_trans['Amount_in_USD']>10]


In [None]:
#Printing result of first 5 USER_ID of customers with first successful Card transaction over $10
users_with_10USD_trans.USER_ID.head(5)

## To save results into csv file use comand below

In [None]:
#Saving results into csv file
users_with_10USD_trans['USER_ID'].to_csv('./users_with_10USD_as_first_transaction.csv',index=False, header='USER_ID')

# TASK 3 - Fraudster Radar

#### Find 5 likely fraudsters (not already found in fraudsters.csv!), provide their user_ids, and explain how you found them and why they are likely fraudsters. Use diagrams, illustrations, etc. Show your work! (25 points)
_(Note: show your work! We are looking for data-driven techniques. If you use Excel, provide the working file. If you use Python, send us a Jupyter notebook, etc.)_

In [None]:
#importing pandas library
import pandas as pd
import numpy as np
from sklearn import tree
from tqdm import tqdm
import dateutil.parser


In [None]:
#loading all csv files using pandas
currency_details = pd.read_csv('./currency_details.csv')
fx_rates = pd.read_csv('./fx_rates.csv')
transactions = pd.read_csv('./transactions.csv',index_col=0)
users = pd.read_csv('./users.csv',index_col=0)
fraudsters = pd.read_csv('./fraudsters.csv',index_col=0)
countries = pd.read_csv('./countries.csv',index_col=0)

In [None]:
#Adding to users table information about known fraudsters
users["Fraudster"] = users['ID'].isin( fraudsters['user_id'])

In [None]:
# fraudsters_details = users[users["Fraudster"]==True]

In [None]:
#Merging transactions and rates_in_usd tables
fraudsters_trans = pd.merge(transactions, users, how='inner',left_on="USER_ID", right_on='ID') 

In [None]:
fraudsters_trans.sort_values(by = ['USER_ID','CREATED_DATE_x'],ascending=True )


In [None]:
fraudsters_trans[fraudsters_trans['USER_ID'] =="4ee8690a-ebf7-435b-9fe2-103e8f83edc6"]

#ATM pattern
topup and then withdrawals


In [None]:
fraudsters_trans['Fraudster'] = fraudsters_trans['Fraudster'].replace(False, 0)
fraudsters_trans['Fraudster'] = fraudsters_trans['Fraudster'].replace(True, 1)


In [None]:
list_for_converting = ["CURRENCY","STATE_x", "MERCHANT_CATEGORY", "MERCHANT_COUNTRY","ENTRY_METHOD", "TYPE", "SOURCE","KYC", "BIRTH_YEAR", "COUNTRY", "STATE_y", "PHONE_COUNTRY"]

# for conv in tqdm(list_for_converting):

for conv in tqdm(list_for_converting):
    hash_words = {word: hash(word) for word in fraudsters_trans[conv]}
    for i in hash_words:
        fraudsters_trans[conv] = fraudsters_trans[conv].replace(i, hash_words[i])
        
        

In [None]:
mid_pos = round(fraudsters_trans['Fraudster'].size/1.5)
df_train = fraudsters_trans.iloc[0:mid_pos]
df_test = fraudsters_trans.iloc[mid_pos:]


In [None]:
# Create the target and features numpy arrays: target, features_one
target = df_train["Fraudster"].values
features_one = df_train[["CURRENCY", "AMOUNT", "STATE_x", "MERCHANT_CATEGORY", "MERCHANT_COUNTRY","ENTRY_METHOD", "TYPE", "SOURCE", "FAILED_SIGN_IN_ATTEMPTS", "KYC", "BIRTH_YEAR", "COUNTRY", "STATE_y", "PHONE_COUNTRY", "HAS_EMAIL", "Fraudster"]].values





In [None]:
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

In [None]:

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = df_test[["CURRENCY", "AMOUNT", "STATE_x", "MERCHANT_CATEGORY", "MERCHANT_COUNTRY","ENTRY_METHOD", "TYPE", "SOURCE", "FAILED_SIGN_IN_ATTEMPTS", "KYC", "BIRTH_YEAR", "COUNTRY", "STATE_y", "PHONE_COUNTRY", "HAS_EMAIL", "Fraudster"]].values

# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
# Fraudster_ID =np.array(test["Fraudster_ID"]).astype(int)
my_solution = pd.DataFrame({"USER_ID":df_test["USER_ID"], "my_prediction":my_prediction, "Fraudster":df_test["Fraudster"]} )
print(my_solution)



# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv")

In [None]:
# df_train["Time_diff_between_transactions"] = 
# df_train["CREATED_DATE_x"].diff()

In [None]:
df_train.columns

In [None]:
datetime.isoformat

In [None]:
df_train.head()

In [None]:

yourdate = dateutil.parser.parse(datestring)

# TASK 4 - Wide-open Analysis

#### The MLRO is asking for some information about account activity for their annual report. Explore the transactional data. Provide a histogram showing the turnover per account (inbound funds + outbound funds only). Don’t forget to handle for different currencies! (10 points)

In [None]:
#importing pandas library
import pandas as pd
import numpy as np
from sklearn import tree
from tqdm import tqdm
import dateutil.parser


#loading all csv files using pandas
currency_details = pd.read_csv('./currency_details.csv')
fx_rates = pd.read_csv('./fx_rates.csv')
transactions = pd.read_csv('./transactions.csv',index_col=0)
users = pd.read_csv('./users.csv',index_col=0)
fraudsters = pd.read_csv('./fraudsters.csv',index_col=0)
countries = pd.read_csv('./countries.csv',index_col=0)

In [None]:
#Adding to users table information about known fraudsters
users["Fraudster"] = users['ID'].isin( fraudsters['user_id'])

#Merging transactions and rates_in_usd tables
df_data = pd.merge(transactions, users, how='inner',left_on="USER_ID", right_on='ID') 

In [None]:
fx_rates = pd.merge(fx_rates,currency_details, how='inner',left_on="ccy", right_on='currency').drop(labels=['currency','iso_code','is_crypto'],axis=1)


In [None]:
#Creating new column "Amount in USD" and applying function Amount * ex_rate / 10**exponent
rates_in_usd = fx_rates[fx_rates['base_ccy']=='USD']
df_data = pd.merge(df_data,rates_in_usd, how='inner',left_on="CURRENCY", right_on='ccy')
df_data['Amount_in_USD'] = (df_data['AMOUNT']*df_data['rate']/10**df_data['exponent'])
df_data = df_data.drop(labels=['base_ccy','rate','ccy','exponent'],axis=1)

In [None]:
#Creating new column "Amount in EUR" and applying function Amount * ex_rate / 10**exponent
rates_in_eur = fx_rates[fx_rates['base_ccy']=='USD']
df_data = pd.merge(df_data,rates_in_eur, how='inner',left_on="CURRENCY", right_on='ccy')
df_data['Amount_in_EUR'] = (df_data['AMOUNT']*df_data['rate']/10**df_data['exponent'])
df_data = df_data.drop(labels=['base_ccy','rate','ccy','exponent'],axis=1)

In [None]:
#Creating new column "Amount in GBP" and applying function Amount * ex_rate / 10**exponent
rates_in_gbp = fx_rates[fx_rates['base_ccy']=='GBP']
df_data = pd.merge(df_data,rates_in_gbp, how='inner',left_on="CURRENCY", right_on='ccy')
df_data['Amount_in_GBP'] = (df_data['AMOUNT']*df_data['rate']/10**df_data['exponent'])
df_data = df_data.drop(labels=['base_ccy','rate','ccy','exponent'],axis=1)

In [None]:
df_data = df_data.drop(labels=['TERMS_VERSION','HAS_EMAIL','ID_y','PHONE_COUNTRY','FAILED_SIGN_IN_ATTEMPTS'],axis=1)


In [None]:
Provide a histogram showing the turnover per account (inbound funds + outbound funds only). 
Don’t forget to handle for different currencies!

hist



In [None]:
df_data.shape