# Exploratory Data Analysis

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

os.chdir('../')
from src.utils.eda import *

## Read dataset

In [2]:
df = pd.read_csv("artifacts\data_ingestion\persona_raw.csv")

## Inspect the Dataset

In [4]:
check_df(df)

##################### Shape #####################
(5000, 5)
##################### Types #####################
<class 'pandas.core.frame.DataFrame'>
Index: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   PRICE    5000 non-null   int64 
 1   SOURCE   5000 non-null   object
 2   SEX      5000 non-null   object
 3   COUNTRY  5000 non-null   object
 4   AGE      5000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 234.4+ KB
None
##################### Head #####################
   PRICE   SOURCE   SEX COUNTRY  AGE
0     39  android  male     bra   17
1     39  android  male     bra   17
2     49  android  male     bra   17
##################### Tail #####################
      PRICE   SOURCE     SEX COUNTRY  AGE
4997     29  android  female     bra   31
4998     39  android  female     bra   31
4999     29  android  female     bra   31
##################### NA #####################
PRICE      0

In [5]:
cat_cols, cat_but_car, num_cols, num_but_cat = grab_col_names(df)

Observations: 5000
Variables: 5
Categorical Cols: 3
Numerical Cols: 2
Categorical but Cardinal Cols: 0
Numerical but Categotical Cols: 0


Based on the output, we have a well-structured dataset containing 5,000 records with 5 columns. The dataset consists of two numeric variables (PRICE and AGE), and three categorical variables (SOURCE, SEX, and COUNTRY). One of the dataset's strengths is its completeness, with no missing values across any columns, making it ideal for analysis without requiring initial cleaning for null values.

The price variable characteristics, ranging from \\$9 to \\$59, with a median price point of \$39. The fact that 90% of prices fall between \\$19 and \\$49 suggests a well-defined tiered pricing strategy. The age demographics reveal a predominantly young customer base, with ages spanning from 15 to 66 years and a median age of 21. Notably, 90\% of customers are between 15 and 43 years old, indicating a strong appeal to younger demographics.

### Analysis of Categorical Variables

In [6]:
cat_summary(df, cat_cols, True)

         SOURCE  Ratio
SOURCE                
android    2974  59.48
ios        2026  40.52
         SEX  Ratio
SEX                
female  2621  52.42
male    2379  47.58
         COUNTRY  Ratio
COUNTRY                
usa         2065  41.30
bra         1496  29.92
deu          455   9.10
tur          451   9.02
fra          303   6.06
can          230   4.60


* Key Distributions
    - Platform: Android dominates (59.48\%) vs iOS (40.52\%)
    - Gender: Slightly more female users (52.42%) than male (47.58\%)
    - Geographic: Two major markets
        US (41.30\%) and Brazil (29.92\%) account for 71.22\% of users. Four other countries (Germany, Turkey, France, Canada) share remaining 28.78\%

* Business Implications
    - Strong presence in US and Brazil suggests opportunity for expansion in other markets
    - Balanced gender distribution indicates successful gender-neutral appeal
    - Android preference might influence development priorities

### Analysis of Numerical Variables

In [7]:
num_summary(df, num_cols, True)

             PRICE          AGE
count  5000.000000  5000.000000
mean     34.132000    23.581400
std      12.464897     8.995908
min       9.000000    15.000000
5%       19.000000    15.000000
10%      19.000000    15.000000
30%      29.000000    17.000000
50%      39.000000    21.000000
70%      39.000000    26.000000
90%      49.000000    36.000000
99%      59.000000    53.000000
max      59.000000    66.000000


In [8]:
# How many sales were realized at which PRICE
print('\nFrequency of PRICE:\n', df["PRICE"].value_counts())


Frequency of PRICE:
 PRICE
29    1305
39    1260
49    1031
19     992
59     212
9      200
Name: count, dtype: int64


* Price Analysis
    - Average price is \$34.13 with a standard deviation of \\$12.46
    - The median price (\$39) being higher than the mean (\\$34.13) suggests a slight skew toward lower price points

* Age Distribution
    - Mean age is 23.58 years with a standard deviation of 9 years
    - Heavily skewed toward younger users:
        - 30\% of users are 17 or younger
        - 50\% are 21 or younger
        - 70\% are 26 or younger
        - Only 10\% of users are above 36 years old
        - Wide age range (15-66 years) but concentrated in younger segments

This data confirms a young-centric customer base.

In [9]:
# Total income by COUNTRY
print('\nTotal income (USD) by COUNTRY:\n', df.groupby(["COUNTRY"])["PRICE"].sum().sort_values(ascending=False))


Total income (USD) by COUNTRY:
 COUNTRY
usa    70225
bra    51354
tur    15689
deu    15485
fra    10177
can     7730
Name: PRICE, dtype: int64


In [10]:
# PRICE average by SOURCE
print('\nAverage PRICE (USD) by SOURCE:\n', df.groupby(["SOURCE"])["PRICE"].mean().sort_values(ascending=False))


Average PRICE (USD) by SOURCE:
 SOURCE
android    34.174849
ios        34.069102
Name: PRICE, dtype: float64


In [11]:
# PRICE average by bereakdown of COUNTRY and SOURCE
print('\n Average PRICE (USD) by SOURCE-COUNTRY breakdown:\n', df.groupby(["SOURCE", "COUNTRY"]).agg({"PRICE": "mean"}))


 Average PRICE (USD) by SOURCE-COUNTRY breakdown:
                      PRICE
SOURCE  COUNTRY           
android bra      34.387029
        can      33.330709
        deu      33.869888
        fra      34.312500
        tur      36.229437
        usa      33.760357
ios     bra      34.222222
        can      33.951456
        deu      34.268817
        fra      32.776224
        tur      33.272727
        usa      34.371703


The combined analysis of income distribution and pricing across platforms and countries reveals nuanced patterns in customer spending behavior. While the USA (\\$70,225) and Brazil (\\$51,354) dominate total revenue, the average pricing analysis shows remarkably consistent pricing across platforms, with Android (\\$34.17) and iOS (\\$34.07) users spending almost identically. This indicates that platform choice doesn't significantly influence spending behavior.

The detailed SOURCE-COUNTRY breakdown provides even more interesting insights. Despite the large differences in total revenue between countries, the average price points remain surprisingly consistent across all market-platform combinations, generally ranging between \\$32-$36. Turkey stands out slightly with the highest average price on Android (\\$36.23), while France shows the lowest on iOS (\\$32.78). This consistency in average prices across markets suggests that pricing strategy is relatively standardized globally, and revenue differences are primarily driven by user volume rather than regional pricing strategies.

These findings indicate that the substantial revenue differences between countries are primarily due to market penetration and user base size rather than pricing differentiation or platform-specific strategies. This presents potential opportunities for growth through increased market penetration in lower-revenue markets, as users show similar spending patterns regardless of their location or platform choice.