# Exploring Data Analsis of Purchase Data

## Trends
1. Majority of players are male
2. Majority of purchaes are made by players between 15 to 24 
3. Most items are only purchased once based on the data file. This might not be true if the data is just a sample.
4. Plyer age distribution is close to normal but slightly right skewed

In [183]:
import os
from pandas import Series, DataFrame
import pandas as pd
import matplotlib.pyplot as plt

MONEY_FORMAT = '${:,.2f}'
PERC_FORMAT = '{:.2%}'
CNT_FORMAT = '{:,}'
FORMAT = {
    'Average Purchase Price': MONEY_FORMAT,
    'Total Purchase Value': MONEY_FORMAT,
    'Percenage of Players': '{:.2%}',
    'Normalized Total':  MONEY_FORMAT,
    'Average Price': MONEY_FORMAT,
    'Total Revenue': MONEY_FORMAT,
    'Total Players': CNT_FORMAT,
    'Number of Purchases': CNT_FORMAT
}

## Load to DataFrame

In [184]:
DATA_PATH = '.'
file = input("What's the input file?")
full_path = os.path.join(DATA_PATH, file)
df_purchase = pd.read_json(full_path)

What's the input file?purchase_data.json


## Player Count

The SN looks like can uniquely identify a player so get the unique count of it.

In [197]:
DataFrame( 
    [ 
        {
            'Total Players': len(df_purchase['SN'].unique()),
        }
    ]
)

Unnamed: 0,Total Players
0,573


## Purchasing Analysis (Total)

In [198]:
DataFrame(
    columns = [
       'Number of Unique Items',
       'Average Price',
       'Number of Purchases',
       'Total Revenue',
    ],
    data = [
       [len(df_purchase['Item ID'].unique()),
        df_purchase.Price.mean(),
        df_purchase.Price.count(),
        df_purchase.Price.sum(),
       ],
    ], 
).style.format(FORMAT)

Unnamed: 0,Number of Unique Items,Average Price,Number of Purchases,Total Revenue
0,183,$2.93,780,"$2,286.33"


## Gender Demographics

In [199]:
def report_demographics(df, groupby):
    '''
    Report on demographics summary
    
    :param df: DataFrame input
    :param groupby: Summarized by field
    :return: Formatted output
    '''
    return DataFrame(
        df.groupby(groupby)['SN'].unique()
    ).assign(
        cnt = lambda x: x['SN'].map(len)
    ).assign(
        perc=lambda x: x['cnt'] / len(df['SN'].unique())
    ).drop(
        'SN', axis=1
    ).rename(
        columns = {
            'cnt': 'Toatl Count',
            'perc': 'Percenage of Players'
        }
    ).style.format(FORMAT)

In [200]:
report_demographics(df_purchase, 'Gender')

Unnamed: 0_level_0,Toatl Count,Percenage of Players
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,100,17.45%
Male,465,81.15%
Other / Non-Disclosed,8,1.40%


## Purchasing Analysis (Gender)
**What the hell is 'Normalized Total'?**

Normalized totals is a very strange name but from the sample chart, it is mostly the same as average purchase per item but some times lightly higer, and when it is higher there are some player purchse multiple items, so I guess it is actually the purchase value per player.

In [201]:
def report_analysis(df, groupby):
    '''
     Generate report of price statistics
    
    :param df: DataFrame input
    :param groupby: Summarized by field
    :return: Formatted output
    '''
    count_unique = lambda x: len(x.unique())
    count_unique.__name__ = 'count_unique'
    df_summary = df.groupby(groupby)['Price', 'SN'].agg(
        {
            'Price': ['count', 'mean', 'sum'],
            'SN': count_unique
        }
    )
 
    # Flattern multi level index for easier selection
    df_summary.columns = ['_'.join(x) for x in df_summary.columns]
    
    return df_summary.assign(
        normalized_total=lambda x: x['Price_sum']/x['SN_count_unique']
    ).drop(
        'SN_count_unique', axis=1
    ).rename(
        columns = {
            'Price_count': 'Purchase Count',
            'Price_mean': 'Average Purchase Price',
            'Price_sum': 'Total Purchase Value',
            'normalized_total': 'Normalized Total',
        }
    ).style.format(FORMAT)

In [202]:
report_analysis(df_purchase, 'Gender')

Unnamed: 0_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Normalized Total
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,136,$2.82,$382.91,$3.83
Male,633,$2.95,"$1,867.68",$4.02
Other / Non-Disclosed,11,$3.25,$35.74,$4.47


## Age Demographics

The sample PDF ashow both Age Demographics and Purchasing Analysis (Age) so I did both.

In [203]:
bin_start = 10
bins = [0]
max_age = df_purchase.Age.max() 
bins.extend(range(bin_start, max_age, 5))
bins.append(max_age + 1)

labels = ['{} to {}'.format(x,y-1) for x,y in zip(bins, bins[1:])]

df_purchase.loc[:, 'Age Range'] = pd.cut(df_purchase['Age'], bins=bins, labels=labels)

report_demographics(df_purchase, 'Age Range')

Unnamed: 0_level_0,Toatl Count,Percenage of Players
Age Range,Unnamed: 1_level_1,Unnamed: 2_level_1
0 to 9,22,3.84%
10 to 14,54,9.42%
15 to 19,139,24.26%
20 to 24,234,40.84%
25 to 29,52,9.08%
30 to 34,44,7.68%
35 to 39,25,4.36%
40 to 45,3,0.52%


## Purchasing Analysis (Age)

In [204]:
report_analysis(df_purchase, 'Age Range')

Unnamed: 0_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Normalized Total
Age Range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0 to 9,32,$3.02,$96.62,$4.39
10 to 14,78,$2.87,$224.15,$4.15
15 to 19,184,$2.87,$528.74,$3.80
20 to 24,305,$2.96,$902.61,$3.86
25 to 29,76,$2.89,$219.82,$4.23
30 to 34,58,$3.07,$178.26,$4.05
35 to 39,44,$2.90,$127.49,$5.10
40 to 45,3,$2.88,$8.64,$2.88


## Top Spenders

Use dense rank here since item count is integer and possible be the same.

In [279]:
def report_top(df, groupby, by='sum', top=5):
    '''
    Report on top items. Default to 'sum' statistics, and top 5.
    
    :param df: DataFrame input
    :param groupby: Summarized by field
    :param by: Summary statistics
    :param top: How many top items
    :return: Formatted output
    '''
    df_summary = df.groupby(groupby)['Price'].agg(
        ['count', 'mean', 'sum']
    )
    
    df_summary['Rank'] = df_summary[by].rank(method='dense', ascending=False)
    
    return df_summary.loc[df_summary['Rank'] <= top].sort_values(
        by=by, 
        ascending=False
    ).rename(
        columns={
            'count': 'Purchase Count',
            'mean': 'Average Purchase Price',
            'sum': 'Total Purchase Value'
        }
    ).style.format(FORMAT)

In [280]:
report_top(df_purchase, 'SN')

Unnamed: 0_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Rank
SN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Undirrala66,5,$3.41,$17.06,1
Saedue76,4,$3.39,$13.56,2
Mindimnya67,4,$3.18,$12.74,3
Haellysu29,3,$4.24,$12.73,4
Eoda93,3,$3.86,$11.58,5


## Most Popular Items

In [281]:
report_top(df_purchase, ['Item ID', 'Item Name'], by='count')

Unnamed: 0_level_0,Unnamed: 1_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Rank
Item ID,Item Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
84,Arcane Gem,11,$2.23,$24.53,1
39,"Betrayal, Whisper of Grieving Widows",11,$2.35,$25.85,1
175,Woeful Adamantite Claymore,9,$1.24,$11.16,2
13,Serenity,9,$1.49,$13.41,2
31,Trickster,9,$2.07,$18.63,2
34,Retribution Axe,9,$4.14,$37.26,2
65,Conqueror Adamantite Mace,8,$1.96,$15.68,3
107,"Splitter, Foe Of Subtlety",8,$3.61,$28.88,3
106,Crying Steel Sickle,8,$2.29,$18.32,3
92,Final Critic,8,$1.36,$10.88,3


## Most Porfitable Items

In [282]:
report_top(df_purchase, ['Item ID', 'Item Name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Purchase Count,Average Purchase Price,Total Purchase Value,Rank
Item ID,Item Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
34,Retribution Axe,9,$4.14,$37.26,1
115,Spectral Diamond Doomblade,7,$4.25,$29.75,2
32,Orenmir,6,$4.95,$29.70,3
103,Singed Scalpel,6,$4.87,$29.22,4
107,"Splitter, Foe Of Subtlety",8,$3.61,$28.88,5
