### Objectives are:

1. Country Spending Analysis: Analyze the data to determine if some countries spend more money in USD than others.
2. Category Popularity Analysis: Analyze which merchant categories are more popular in some countries compared to others.

### Hypothesis are:
1. USA and Europe regions are expected to spend more money in USD than other regions.
2. Most popular merchant categories in pending counties are expected to be different from other regions that spend less money.
3. Transaction count should be higher in countries that spend more money.


In [28]:
#* MATH
%pip install pandas

#* GRAPHS
%pip install matplotlib
%pip install seaborn
%pip install plotly

#* UTILS
%pip install dateutill
%pip install requests
%pip install tqdm


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement dateutill (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for dateutill[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [20]:

import os
import tqdm

import pandas as pd
import seaborn as sns

import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go


from typing import Final

In [21]:
path_to_dataset: Final[str] = "synthetic_fraud_data.csv" #? Will be used by default
path_to_processed_data: Final[str | None] = "processed_data.csv" #? If provided, will be used instead of loading from the original dataset.

In [22]:
def load_dataset_with_progress(file_path: str) -> pd.DataFrame:
    file_size = os.path.getsize(file_path)

    chunks = []
    with tqdm.tqdm(total=file_size, unit='B', unit_scale=True, desc=f"Loading {file_path}") as pbar:
        for chunk in pd.read_csv(file_path, chunksize=8192):
            pbar.update(chunk.memory_usage(deep=True).sum())
            chunks.append(chunk)

    return pd.concat(chunks, ignore_index=True)

if os.path.exists(path_to_dataset) and path_to_processed_data is None:
    dataset = load_dataset_with_progress(path_to_dataset)
elif path_to_processed_data is not None and os.path.exists(path_to_processed_data):
    dataset = load_dataset_with_progress(path_to_processed_data)
else:
    raise FileNotFoundError("Either the dataset or processed data file does not exist.")

Loading processed_data.csv: 8.66GB [00:40, 215MB/s]                             


In [23]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,transaction_id,customer_id,card_number,timestamp,merchant_category,merchant_type,merchant,amount,currency,...,channel,device_fingerprint,ip_address,distance_from_home,high_risk_merchant,transaction_hour,weekend_transaction,velocity_last_hour,is_fraud,amount_usd
0,0,TX_a0ad2a2a,CUST_72886,6646734767813109,2024-09-30 00:00:01.034820+00:00,Restaurant,fast_food,Taco Bell,294.87,GBP,...,mobile,e8e6160445c935fd0001501e4cbac8bc,197.153.60.199,0,False,0,False,"{'num_transactions': 1197, 'total_amount': 334...",False,375.988737
1,1,TX_3599c101,CUST_70474,376800864692727,2024-09-30 00:00:01.764464+00:00,Entertainment,gaming,Steam,3368.97,BRL,...,web,a73043a57091e775af37f252b3a32af9,208.123.221.203,1,True,0,False,"{'num_transactions': 509, 'total_amount': 2011...",True,560.596608
2,2,TX_a9461c6d,CUST_10715,5251909460951913,2024-09-30 00:00:02.273762+00:00,Grocery,physical,Whole Foods,102582.38,JPY,...,web,218864e94ceaa41577d216b149722261,10.194.159.204,0,False,0,False,"{'num_transactions': 332, 'total_amount': 3916...",False,683.60898
3,3,TX_7be21fc4,CUST_16193,376079286931183,2024-09-30 00:00:02.297466+00:00,Gas,major,Exxon,630.6,AUD,...,mobile,70423fa3a1e74d01203cf93b51b9631d,17.230.177.225,0,False,0,False,"{'num_transactions': 764, 'total_amount': 2201...",False,403.96236
4,4,TX_150f490b,CUST_87572,6172948052178810,2024-09-30 00:00:02.544063+00:00,Healthcare,medical,Medical Center,724949.27,NGN,...,web,9880776c7b6038f2af86bd4e18a1b1a4,136.241.219.151,1,False,0,False,"{'num_transactions': 218, 'total_amount': 4827...",True,454.398202


### Dataset is limited with regions.

Even hough the dataset has limited number of regions, it still allows us to fully answer the hypothesis as USA and Europe (partially) is in the dataset.


In [24]:
transactions_per_country = dataset['country'].value_counts().reset_index()
transactions_per_country.columns = ['country', 'transaction_count']


fig = px.choropleth(transactions_per_country,
                    locations="country",
                    locationmode="country names",
                    color="transaction_count",
                    hover_name="country",
                    color_continuous_scale=px.colors.sequential.Plasma,
                    title="Number of Transactions per Country",
                    labels={'transaction_count': 'Number of Transactions'})


fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular'
    ),
    height=600,
    margin={"r":0,"t":40,"l":0,"b":0}
)

fig.show()

### Now let`s see the category popularity and spending per country.

In this sections we are focusing on:
1. Total Spending per Country in USD
2. Top Merchant Categories by Country
3. Average Transaction Amount per Country
4. Transaction Amount by Currency
5. Distribution of Merchant Categories per Country
6. Top Currency by transaction count

In [32]:

country_spending = dataset.groupby('country')['amount_usd'].sum().reset_index().sort_values(by='amount_usd', ascending=False)

fig = px.bar(country_spending, x='country', y='amount_usd',
             title='Total Spending per Country in USD',
             labels={'country': 'Country', 'amount_usd': 'Total Spending (USD)'},
             color='amount_usd', color_continuous_scale=px.colors.sequential.Viridis)
fig.update_layout(xaxis_tickangle=-45)
fig.show()


category_spending = dataset.groupby(['country', 'merchant_category'])['amount_usd'].sum().reset_index()
category_spending = category_spending.sort_values(['country', 'amount_usd'], ascending=[True, False])

top_categories = category_spending.groupby('country').head(len(dataset))

fig = px.bar(top_categories, x='country', y='amount_usd', color='merchant_category',
             title='Top Merchant Categories by Country (Sorted by Amount Spent)',
             labels={'amount_usd': 'Total Amount (USD)', 'country': 'Country', 'merchant_category': 'Merchant Category'},
             height=600)

fig.update_layout(xaxis_tickangle=-45, barmode='stack')
fig.show()


avg_transaction = dataset.groupby('country')['amount_usd'].mean().reset_index()
fig = px.scatter(avg_transaction, x='country', y='amount_usd', size='amount_usd', color='amount_usd',
                 title='Average Transaction Amount by Country',
                 labels={'country': 'Country', 'amount_usd': 'Average Transaction Amount (USD)'},
                 color_continuous_scale=px.colors.sequential.Plasma)
fig.update_layout(xaxis_tickangle=-45)
fig.show()


currency_count = dataset['currency'].value_counts().reset_index()
currency_count.columns = ['currency', 'count']
fig = px.pie(currency_count, values='count', names='currency', title='Transaction Count by Currency')
fig.show()

heatmap_data = dataset.groupby(['country', 'merchant_category']).size().reset_index(name='count')
heatmap_pivot = heatmap_data.pivot(index='country', columns='merchant_category', values='count').fillna(0)

fig = px.imshow(heatmap_pivot,
                labels=dict(x="Merchant Category", y="Country", color="Transaction Count"),
                x=heatmap_pivot.columns,
                y=heatmap_pivot.index,
                aspect="auto",
                title="Heatmap of Transaction Counts by Country and Merchant Category")
fig.update_xaxes(side="top")
fig.show()
currency_counts = dataset['currency'].value_counts()

fig = go.Figure(data=[go.Bar(x=currency_counts.index, y=currency_counts.values)])
fig.update_layout(
    title='Top Currencies by Transaction Count',
    xaxis_title='Currency',
    yaxis_title='Number of Transactions',
    height=400
)
fig.show()

### So sum things up:

## Objectives were:
1. Country Spending Analysis: Analyze the data to determine if some countries spend more money in USD than others. (Completed ✅)
2. Category Popularity Analysis: Analyze which merchant categories are more popular in some countries compared to others. (Completed ✅)

## Hypothesis were:
1. USA and Europe regions are expected to spend more money in USD than other regions. 
    
    Hypothesis incorrect. In fact, the most spending country was Mexico, Brazil, and Russia, respectively.
    

2. Most popular merchant categories in top spending counties are expected to be different from other regions that spend less money.
    
    Hypothesis incorrect. Most popular merchant categories in top spending countries were the same as in regions that spent less money. However, the amount of money spend on each category was larger.

3. Transaction count should be higher in countries that spend more money.
    
    Hypothesis incorrect. The most spending country were the ones, that covered most amount of regions, such as Europe. However, it is important to note that 
    originally Europe was expected to be "country that spend more money".