In [1]:
import pandas as pd

# Problem Description

Find the 3-month rolling average of total revenue from purchases given a table with users, their purchase amount, and date purchased. Do not include returns which are represented by negative purchase values. Output the year-month (YYYY-MM) and 3-month rolling average of revenue, sorted from earliest month to latest month.

A 3-month rolling average is defined by calculating the average total revenue from all user purchases for the current month and previous two months. The first two months will not be a true 3-month rolling average since we are not given data from last year. Assume each month has at least one purchase.




## First look at Data

In [2]:
amazon_purchases = pd.read_csv('amazon_purchases.csv')
amazon_purchases.head(3)

Unnamed: 0,user_id,created_at,purchase_amt
0,10,2020-01-01,3742
1,11,2020-01-04,1290
2,12,2020-01-07,4249


## Firsts Tougths

* Discard negative values

* create a year-month column

* groupby year-month

* apply formula 3-month rolling


## Data Analysis

In [3]:
#checking for missing values and format of columns
print(amazon_purchases.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       100 non-null    int64 
 1   created_at    100 non-null    object
 2   purchase_amt  100 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 2.5+ KB
None


There is no missing values, but date coumns are not in optimal format

date -> to_datetime

In [5]:
amazon_purchases.created_at = pd.to_datetime(amazon_purchases.created_at, format='%Y-%m-%d')
amazon_purchases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   user_id       100 non-null    int64         
 1   created_at    100 non-null    datetime64[ns]
 2   purchase_amt  100 non-null    int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 2.5 KB


Now that data is fixed lets start...
## Solution

In [7]:
import numpy as np
df = amazon_purchases.copy()

#discard negative values
df = df[df['purchase_amt'] > 0]

#creating and groupby year-month
df['year_month'] = df.created_at.dt.strftime('%Y-%m')
gby = df.groupby('year_month').purchase_amt.sum().sort_index().reset_index()

#apply formula
for i in range(gby.shape[0]):
    start = i-2 if (i-2 >0) else 0 #fix for first 2 months
    gby.loc[i,'rolling3_avg'] = np.mean(gby.purchase_amt.values[start:i+1])
output = gby[['year_month', 'rolling3_avg']]

## Final Output

In [9]:
output.round(2)

Unnamed: 0,year_month,rolling3_avg
0,2020-01,26292.0
1,2020-02,23493.5
2,2020-03,25535.67
3,2020-04,24082.67
4,2020-05,25417.67
5,2020-06,24773.33
6,2020-07,25898.67
7,2020-08,25497.33
8,2020-09,24544.0
9,2020-10,21211.0
