# Mobile app traffic analysis
*by Anna Rezyapova*

### Task description
Let's imagine mobile application that receives traffic from different sources. This traffic is not homogeneous and we need to know it better in order to optimize marketing activity

Your task is to analyze dataset in order to find useful insights: what segments of data bring traffic with the best quality (Main metric is conversion rate from lead to client. There can be other metrics as well) 


You can do this by answering following questions:
-How users are distributed over countries?
-How many outliers are there in data (in terms of deposits)?
-Find segments with best conversion rate (client/lead ratio) and explain why you consider them best ones
-Visualize deposits distribution over sources and channels
-What are your advices to marketing team in order to optimize their activity?

To do this you have a synthetic dataset that contains history of users' activities (registrations and deposits)
Data description: client_id - unique id of lead/client. it's assigned during registration and isn't changed anymore 
Country - country of lead/client (iso2). It's in the separate file (countries.csv)
Source - source of traffic acquisition. There are two possible sources (posts and telegram channel). if Source contains "postid" - it means that lead came from article. id of post doesn't matter. if Source contains "telegram" - it means that lead came from telegram  
channel - channel of traffic. For example, user can come from 'telegram' source and from 'affiliate' channel
Clicks - amount of clicks user made during first day after registration
Latency - time of application loading in miliseconds
Depo - amount of deposit, USD



Expected result is Jupyter notebook with Python code showing answers to questions above

Glossary:
Lead: user who registered inside mobile app
Client: user who registered inside mobile app AND made a deposit


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

%matplotlib inline

In [None]:
data_path = '/content/drive/MyDrive/Colab Notebooks/synthetic_data.csv'
countries_path = '/content/drive/MyDrive/Colab Notebooks/countries.csv'

In [None]:
df = pd.read_csv(data_path)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,depo,segment,channel,clicks,latency,client_id
0,0,0,postid_4057,smm,1.0,2.649725,1442498
1,1,0,telegram,affiliate,10.0,2.610846,7865631
2,2,0,postid_8542,facebook,13.0,3.001162,8165584
3,3,0,telegram,direct,0.0,1.788369,5893056
4,4,0,telegram,smm,0.0,1.932069,3780924


In [None]:
countr_df = pd.read_csv(countries_path)

In [None]:
countr_df.head()

Unnamed: 0,country,client_id
0,IN,6348826
1,FR,6751691
2,DE,8638448
3,LT,4722696
4,ES,2411132


### How users are distributed over countries?

In [None]:
by_countries = countr_df.groupby('country', as_index=False)\
        .agg({'client_id':'count'})\
        .rename(columns = {'client_id':'num_clients'})

In [None]:
by_countries

Unnamed: 0,country,num_clients
0,DE,20202
1,ES,20130
2,FR,19946
3,IN,19844
4,IS,20013
5,IT,19833
6,LT,19917
7,LU,19567
8,MO,19864
9,US,20105


In [None]:
fig = px.bar(by_countries, x='country', y='num_clients',\
             labels={
                     "num_clients": "Number of clients"},\
             title = 'Number of clients by countries')
fig.show()

As we can see, clients are distributed over countries almost uniformly, i.e. there are around 20K clients per country.

### How many outliers are there in data (in terms of deposits)?

To start with, let us check the outliers with the box plot and histogram of deposits (only non-zero deposits are included):

In [None]:
fig = px.box(df.query('depo!=0'), y="depo")
fig.show()

In [None]:
fig = px.histogram(df.query('depo!=0'), x="depo")
fig.show()

The box plot showed that there are several outliers starting from depo = 5615 with the max outlier around 31K. Also, there are deposits < 0, that are outliers, as well.

In [None]:
df.query('depo < 0 or depo >= 5615').shape

(208, 7)

In total, there are 208 outliers in the data. The box plot and histogram of the deposits data after eliminating outliers:

In [None]:
fig = px.box(df.query('depo > 0 & depo < 5615'), y="depo")
fig.show()

In [None]:
fig = px.histogram(df.query('depo > 0 & depo < 5615'), x="depo")
fig.show()

In [None]:
df_wo_out = df.query('depo > 0 & depo < 5615')

### Find segments with best conversion rate (client/lead ratio) and explain why you consider them best ones

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,depo,segment,channel,clicks,latency,client_id
0,0,0,postid_4057,smm,1.0,2.649725,1442498
1,1,0,telegram,affiliate,10.0,2.610846,7865631
2,2,0,postid_8542,facebook,13.0,3.001162,8165584
3,3,0,telegram,direct,0.0,1.788369,5893056
4,4,0,telegram,smm,0.0,1.932069,3780924


In [None]:
df['segment1'] = df['segment'] 

In [None]:
df.loc[df.segment.str.contains('postid'), 'segment1'] = 'post'

In [None]:
segm1 = df.groupby('segment1', as_index = False)\
        .agg({'latency':'count'})\
        .rename(columns = {'latency' : 'leads'})

In [None]:
segm1_cl = df.query('depo != 0').groupby('segment1', as_index = False)\
            .agg({'latency':'count'})\
            .rename(columns = {'latency' : 'clients'})

In [None]:
segm1 = segm1.merge(segm1_cl, on = 'segment1')

In [None]:
segm1['conversion'] = np.around(segm1['clients'] / segm1['leads'] * 100,\
                                decimals = 2)

In [None]:
segm1 = segm1.sort_values('conversion', ascending = False)

In [None]:
segm1

Unnamed: 0,segment1,leads,clients,conversion
1,telegram,61754,4038,6.54
0,post,157560,6060,3.85


As it can be inferred from the table above, the conversion from lead to client is higher from the **telegram** segment (6,54%), rather than from posts (3,85%).

In [None]:
by_channel = df.groupby('channel', as_index = False)\
              .agg({'latency':'count'})\
              .rename(columns = {'latency' : 'leads'})

In [None]:
by_channel_cl = df.query('depo != 0')\
              .groupby('channel', as_index = False)\
              .agg({'latency':'count'})\
              .rename(columns = {'latency' : 'clients'})

In [None]:
by_channel = by_channel.merge(by_channel_cl, on = 'channel')

In [None]:
by_channel['conversion'] = np.around(by_channel['clients'] / by_channel['leads'] * 100, decimals = 2)

In [None]:
by_channel = by_channel.sort_values('conversion', ascending = False)

In [None]:
by_channel

Unnamed: 0,channel,leads,clients,conversion
3,smm,65805,3660,5.56
2,facebook,76516,4008,5.24
0,affiliate,10857,536,4.94
1,direct,21714,1007,4.64
4,social media,42250,793,1.88


As per the conversion by different channels, **smm** and **facebook** channels have the highest conversion rates (5,56% and 5,24%, respectively), while social media source has the lowest conversion rate (1,88%).

### Visualize deposits distribution over sources and channels

In [None]:
fig = px.histogram(df.query('depo > 0 & depo < 5615'), x="depo", color="segment1")
fig.show()

In [None]:
fig = px.box(df.query('depo > 0 & depo < 5615'), x="segment1", y="depo")
fig.show()

The distributions of two segments show that on average deposits from posts are higher (median deposit = 397), while median deposit from telegram equals 204.


Before plotting the distributions of deposits by channels data with missing values in channel column were eliminated.

In [None]:
fig = px.histogram(df[df['channel'].notna()==True].query('depo > 0 & depo < 5615'),\
                   x="depo", color="channel")
fig.show()

In [None]:
fig = px.box(df[df['channel'].notna()==True].query('depo > 0 & depo < 5615'), x="channel", y="depo")
fig.show()

Deposits by channels vary less (in terms of distribution) compared to the deposits by segments. For example, social media channel has the lowest median egual to 208.5, while facebook has the highest median deposit (366). Also, we can see that deposits from smm and direct channels are more uniformly distributed.

### What are your advices to marketing team in order to optimize their activity?

*   Traffic from telegram segment has the highest conversion, however, the median deposit from it is almost 2 times lower, than from posts. It means that in order to optimize marketing campaign, the deposit size for the telegram segment should be increased or the conversion rate from posts should be raised.
*   In terms of channels, smm has the highest conversion, but the median deposit is not the largest, in this way, the deposit size for the smm channel should be raised. Also, facebook has the highest depost size and its conversion should be higher to bring more profit.

