# <center>Test assignment<br>Position: Product Analyst</center>

<im src="http://vignette1.wikia.nocookie.net/cuttherope/images/c/c0/Zeptolab_logo.png/revision/latest?cb=20130529115328">

---

To complete the task you need to calculate the LTV of cohorts of users for all available days, submit the notebook with calculations and a final csv file. Please, provide comments for your code and make sure it is reproducible. 

- **Cohort** means all players who installed the game on the same day. Cohort numeration in this task starts with 3001 (for players who installed on January 1st 2017)
- **LTV** includes two sources - revenue from in-app purchases (IAP) and revenue from advertisments (Ads), which is a multiplication of total ads watched on a particular day by the corresponding eCPM, divided by 1000
- **LTV** calculation formula: 

$LTV_{\text{day N}} = \sum\limits_{i=1}^{N}{\frac{\text{Ads Revenue}_{\text{day i}} + \text{IAP Revenue}_{\text{day i}}}{\text{Cohort installs}}}$

Output file format:

Cohort_number  | $LTV_1$ | $LTV_2$ | $LTV_3$ | ... | $LTV_{N-1}$ | $LTV_N$
-------------- | ------- | ------- | ------- |-----| ----------- | ------
3001           |   0.01  |  0.015  |  0.04   | ... |     0.25    |  0.28
3002           |   0.03  |  0.04   |  0.12   | ... |     1.28    |  NaN
...            |   ...   |   ...   |   ...   | ... |      ...    |  ...
3031           |   0.02  |  0.07   |   NaN   | ... |      ...    |  NaN

In [269]:
import pandas as pd

# Load data

# 'installs.xlsx': contains data about when users installed the game
installs = pd.read_excel('installs.xlsx').rename(columns={'cohort_number':'cohort'})#, index_col='Date')
# 'total_ads_watched.txt': the total number of ads watched by each cohort on each day
total_ads_watched = pd.read_csv('total_ads_watched.txt', sep='\t')
total_ads_watched = pd.melt(total_ads_watched, id_vars=['Date'], var_name='cohort', value_name='total_ads_watched_value')
# total_ads_watched =  total_ads_watched.set_index('Date')

# 'eCPM.txt': contains the effective cost per thousand impressions (eCPM) for ads on each day
eCPM = pd.read_csv('eCPM.txt', sep='\t')#, index_col='Date')

# 'in_game_purchases_revenue.csv': records revenue from in-game purchases for each cohort on each day
in_game_purchases = pd.read_csv('in_game_purchases_revenue.csv')
in_game_purchases = pd.melt(in_game_purchases, id_vars=['Date'], var_name='cohort', value_name='in_game_purchases_value')
# in_game_purchases =  in_game_purchases.set_index('Date')

# Convert dates to datetime format
installs.Date = pd.to_datetime(installs.Date)
total_ads_watched.Date = pd.to_datetime(total_ads_watched.Date)
eCPM.Date = pd.to_datetime(eCPM.Date)
in_game_purchases.Date = pd.to_datetime(in_game_purchases.Date)

# Convert to cohort columns to one format type for merging
installs.cohort = installs.cohort.astype(int)
total_ads_watched.cohort = total_ads_watched.cohort.astype(int)

  for idx, row in parser.parse():
  total_ads_watched.Date = pd.to_datetime(total_ads_watched.Date)
  eCPM.Date = pd.to_datetime(eCPM.Date)
  in_game_purchases.Date = pd.to_datetime(in_game_purchases.Date)


In [273]:
# Merge dataframes
merged_df = installs.merge(total_ads_watched, left_on=['cohort', 'Date'], right_on=['cohort', 'Date'], how='left')
merged_df = merged_df.merge(eCPM, on='Date', how='left')

in_game_purchases.cohort =in_game_purchases.cohort.astype(int)
merged_df = merged_df.merge(in_game_purchases, left_on=['cohort', 'Date'], right_on=['cohort', 'Date'], how='left')
merged_df.head()

Unnamed: 0,Date,cohort,installs,total_ads_watched_value,eCPM,in_game_purchases_value
0,2017-01-01,3001,62078,151902,5.879974,3624.0337
1,2017-01-02,3002,62601,143598,5.259028,3785.3424
2,2017-01-03,3003,56958,126034,4.790151,3372.0197
3,2017-01-04,3004,54959,116066,4.701444,2702.2086
4,2017-01-05,3005,55273,116459,4.802823,4367.9685


In [271]:
# Calculate LTV
df = merged_df.copy()

df['ads_revenue'] = df.total_ads_watched_value * df.eCPM  / 1000
df['ltv'] = ( (df.ads_revenue + df.in_game_purchases_value) / df.installs ).round(2)
df

Unnamed: 0,Date,cohort,installs,total_ads_watched_value,eCPM,in_game_purchases_value,ads_revenue,ltv
0,2017-01-01,3001,62078,151902,5.879974,3624.0337,893.179766,0.07
1,2017-01-02,3002,62601,143598,5.259028,3785.3424,755.185933,0.07
2,2017-01-03,3003,56958,126034,4.790151,3372.0197,603.721891,0.07
3,2017-01-04,3004,54959,116066,4.701444,2702.2086,545.677857,0.06
4,2017-01-05,3005,55273,116459,4.802823,4367.9685,559.332005,0.09
5,2017-01-06,3006,59910,132946,4.524031,2777.089,601.451891,0.06
6,2017-01-07,3007,59922,137041,5.651871,5088.1495,774.538092,0.1
7,2017-01-08,3008,55167,119303,5.563165,4039.3633,663.702248,0.09
8,2017-01-09,3009,46065,82906,3.928431,2226.672,325.690476,0.06
9,2017-01-10,3010,45813,85636,3.776362,2298.2953,323.392573,0.06


In [272]:
# Create a new DataFrame for pivoting
pivot_df = df.pivot(index='cohort', columns='Date', values='ltv')

# Rename columns to LTV1, LTV2, ..., LTVn
pivot_df.columns = [f'LTV{idx + 1}' for idx, col in enumerate(pivot_df.columns)]

# Optional: handle missing data
pivot_df.fillna(0.0, inplace=True)

# Reset the index if necessary to include the cohort number as a column
pivot_df.reset_index(inplace=True)
pivot_df

Unnamed: 0,cohort,LTV1,LTV2,LTV3,LTV4,LTV5,LTV6,LTV7,LTV8,LTV9,...,LTV22,LTV23,LTV24,LTV25,LTV26,LTV27,LTV28,LTV29,LTV30,LTV31
0,3001,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3002,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3003,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3004,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3005,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,3006,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,3007,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,3008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,3009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,3010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Many cohorts have low or zero LTVs in the initial time periods (LTV1, LTV2, etc.). This could suggest that new users or installations do not generate significant revenue immediately. This is may be common in apps if user monetization builds over time through engagement.

OR it could indicate missing data or that these cohorts haven’t reached those later stages yet at the time of this data extraction. 

In [267]:
pivot_df.to_csv('final.csv', index=False)