# Project: Product Management From Scratch

The boutique ice cream sandwich startup you've joined, Snackr, has gotten by on the personality of it's founder, I. Sandwich (real name?), for too long.  The last 2 quarters of features the team has launched to the website have not landed with customers.

You've decided it's time to step up, lay out a plan for how you're going to identify paths forward which are more likely to succeed than "whatever I. Sandich happened to promise a random person he met at a bar last night".  You pitch the team on an iterative design process which brings qualitative and quantitative techniques together in order to minimize industry standard risk types so that you can move forward confidently.

Someone in your company recently learned about a/b experiments - so they instrumented tracking and ran an experiment based on your founder's unfounded random idea.

Unsurprisingly - the results of the experiment flopped. 

But there's a bright side! 

You now have some user event logs you can dig through in order to surface user personas and a journey.  You tell the team you'd like to collect all of the known assumptions before you launch the next experiment to the site so you have the best chance of success.  You tell the CEO and Board (wait - is it really just his friends from college and mom?) that at the end of your week long design sprint you're going to launch an experiment.  

Your goal for this project is to identify actionable hypotheses based on the data for your team to act on based on the type of risks identified.  You want to be able to show the largest possible impact in the shortest time possible.



## NOTES
Along the way you'll identify:
 - Actionable user personas & stages of a user journey
 - Develop hypotheses based on the data 
 
Desired Actions to identify:
 - User Interviews
 - Usability Testing
 - A/B Experiments


<!--
%%ulab_page_divider
--><hr/>

## Section 1 - Evaluate previous a/b experiment [STUDENT]

One of your engineers helpfully exported all of the events from the previous a/b experiment for you.  

She sent them over to you broken down by country and explained how to put them all into one dataframe.

1: Ensure you got data  
2: Check to see if the previous a/b experiment was valid (95% confidence)  
3: Check if the experiment or control won by performing a chi-square test on outcomes and reporting on if the difference is signifcant.


In [43]:
import numpy as np
import pandas as pd
import glob
import datetime
import itertools

log_files = glob.glob(r'event_logs_part_*.csv')
df = pd.concat((pd.read_csv(f) for f in log_files)).drop(['Unnamed: 0'],axis=1)

## Your Work Goes Here

#1: 

#2: 

#3: 

## Section 1 - Evaluate previous a/b experiment [SOLUTION]

One of your engineers helpfully exported all of the events from the previous a/b experiment for you.  

She sent them over to you broken down by country and explained how to put them all into one dataframe.

1: Ensure you got data  
2: Check to see if the previous a/b experiment was valid (95% confidence)  
3: Check if the experiment or control won by performing a chi-square test on outcomes and reporting on if the difference is signifcant.

In [42]:
import numpy as np
from scipy import stats
import pandas as pd
import glob
import datetime
import itertools

log_files = glob.glob(r'event_logs_part_*.csv')
df = pd.concat((pd.read_csv(f) for f in log_files)).drop(['Unnamed: 0'],axis=1)

In [12]:
#S1_1
df.describe()

Unnamed: 0,event_uuid,user_uuid,event_time,device_type,session_uuid,experiment_group,user_country,event_page,event_type
count,1250951,1250951,1250951,1250951,1250951,1250951,1250951,1250951,1250951
unique,1250951,81644,526684,4,227721,2,5,4,5
top,a3a49603-35e5-4c16-b5c5-c1ffbcf3115e,51ca13c8-7deb-48de-9569-f73e1fcffdf9,2019-10-03 13:12:54,mobile_web,0a2fe7b0-6a44-4d9c-96d7-24e09f560a96,control,US,item_page,view
freq,1,48,12,624179,10,627809,877413,500092,821166


In [13]:
#S1_2
## Check for duplicate events
df[df.event_uuid.duplicated()]
#df = df[~df.event_uuid.duplicated()]

Unnamed: 0,event_uuid,user_uuid,event_time,device_type,session_uuid,experiment_group,user_country,event_page,event_type


In [14]:
## Get the number of sessions by experiment or control group
df.groupby('experiment_group').apply(lambda x: x['session_uuid'].nunique()).reset_index(name='count')

Unnamed: 0,experiment_group,count
0,control,114245
1,experiment,113476


In [17]:
## Ensure the experiment was split evenly between sessions

control_sessions = df.groupby('experiment_group').apply(lambda x: x['session_uuid'].nunique())['control']
experiment_sessions = df.groupby('experiment_group').apply(lambda x: x['session_uuid'].nunique())['experiment']


desired_p = 0.5
z_score = 1.960  #TODO Student to get the z-score for 95% confidence range?

total_control =  22725 #TODO Student to get the total number of sessions
total_experiment = 22879 #TODO Student to get the total number of sessions


standard_deviation = np.sqrt((0.5*0.5)/(total_control + total_experiment))
margin_error = standard_deviation * z_score

confidence_interval = (desired_p-margin_error, desired_p+margin_error)

p_hat = total_control / (total_control + total_experiment)

print("p_hat:\t\t", p_hat)
print("confidence int:\t", confidence_interval)
print("Continue analysis?:", p_hat > confidence_interval[0] and p_hat < confidence_interval[1])

p_hat:		 0.4983115516182791
confidence int:	 (0.49541093079380116, 0.50458906920619884)
Continue analysis?: True


In [18]:
## Get the number of purchases by experiment or control group
df.groupby('experiment_group')['event_type'].apply(lambda x: (x == 'purchase').sum()).reset_index(name='count')


Unnamed: 0,experiment_group,count
0,control,2410
1,experiment,2497


In [19]:
control_purchase = df.groupby('experiment_group')['event_type'].apply(lambda x: (x == 'purchase').sum())['control']
experiment_purchase = df.groupby('experiment_group')['event_type'].apply(lambda x: (x == 'purchase').sum())['experiment']


In [20]:
#S1_3
#chisuared test

#P-value here greater than .05 (our test for significance).  This result could be due to randomness

control_stats = [control_purchase, control_sessions - control_purchase]
experiment_stats = [experiment_purchase, experiment_sessions - experiment_purchase]
outcomes =  np.array([control_stats, experiment_stats])
stats.chi2_contingency(outcomes)

(2.191268739121107,
 0.13879494679645424,
 1,
 array([[   2461.78532063,  111783.21467937],
        [   2445.21467937,  111030.78532063]]))

### Seciton 1 - Solution summary

1: The experiment ran.  
2: Event data came in and it was bucketed correctly between sessions  
3: But there was no significant change in conversion between the treatment and control.

## Section 2 - Develop Actionable Personas and Journey Steps [STUDENT]

Now you're going to devolp actionable user personas & stages of a user journey.

1: Find the actionable attributes of your users upon which to build your personas  
2: Find the different stages of the user's journey  
3: Identify a risks or opportunities for your team to address with hypotheses  



In [104]:
## Your work goes here

## Section 2 - Develop Actionable Personas and Journey Steps [SOLUTION]

In [17]:
#S1_3
#TODO - MATT - Generate opportunities in dataset.
#1 - bounce
#2 - bug
#3 - typo
#4 - l10n
#5 - price
#6 - flow

In [35]:
#S2_1
# Going back to the data - the user should be able to identify the ways the user's differ.
# In this case the only two ways are device type and user country
df.describe()

Unnamed: 0,event_uuid,user_uuid,event_time,device_type,session_uuid,experiment_group,user_country,event_page,event_type
count,1250951,1250951,1250951,1250951,1250951,1250951,1250951,1250951,1250951
unique,1250951,81644,526684,4,227721,2,5,4,5
top,a3a49603-35e5-4c16-b5c5-c1ffbcf3115e,51ca13c8-7deb-48de-9569-f73e1fcffdf9,2019-10-03 13:12:54,mobile_web,0a2fe7b0-6a44-4d9c-96d7-24e09f560a96,control,US,item_page,view
freq,1,48,12,624179,10,627809,877413,500092,821166


In [37]:
#S2_1
# One possible way students might get the values
# This is the minimal effort. This student will have the types,
# but not have the quantities of users in each group.
devices = df.device_type.unique()
countries = df.user_country.unique()

In [38]:
#S2_1
# This and the the following code samples are examples which are 
# slightly better than the first.
# Now the student might see which country or which device type is most imporant, 
# but not have a quantitative reason for the combination of the two.

# Said another way - they may jump to the conclusion that "US" and "Mobile Web" is the right answer.
# Simpsons paradox could trick them into thinking the independent sums are large when combined


df.groupby('user_country')['user_uuid'].nunique().reset_index(name='count')

Unnamed: 0,user_country,count
0,CN,1608
1,FR,4178
2,JP,2407
3,UK,16222
4,US,57229


In [39]:
#S2_1
# Continuation of last example
df.groupby('device_type')['user_uuid'].nunique().reset_index(name='count')

Unnamed: 0,device_type,count
0,android,28630
1,desktop_web,20177
2,ios,42285
3,mobile_web,64357


In [31]:
# S2_1 
# In this example the student has moved on to looking for the most impactful area
# They've managed to count the number of sessions at the device / country level.

# If they use this method and then manually get the numbers that's a hard way to do it,
# but the correct answer
#list(itertools.product(devices,countries))

#df['Rank'] = (df['device_type'] + df['user_country']).astype(int).rank(method)
df.groupby(['device_type', 'user_country'])['user_uuid'].nunique()

device_type  user_country
android      CN                545
             FR               1471
             JP                812
             UK               5603
             US              20199
desktop_web  CN                392
             FR               1034
             JP                625
             UK               3940
             US              14186
ios          CN                819
             FR               2156
             JP               1221
             UK               8399
             US              29690
mobile_web   CN               1300
             FR               3317
             JP               1903
             UK              12713
             US              45124
Name: user_uuid, dtype: int64

In [46]:
#S2_2

# This student has made their life really easy.
# They've not only found the number of sessions per device/region combination, 
# but they've also just rank ordered them - so they know where to start
df.groupby(['device_type', 'user_country'])['user_uuid'].nunique().rank(method='dense')

device_type  user_country
android      CN               2.0
             FR               9.0
             JP               4.0
             UK              14.0
             US              18.0
desktop_web  CN               1.0
             FR               6.0
             JP               3.0
             UK              13.0
             US              17.0
ios          CN               5.0
             FR              11.0
             JP               7.0
             UK              15.0
             US              19.0
mobile_web   CN               8.0
             FR              12.0
             JP              10.0
             UK              16.0
             US              20.0
Name: user_uuid, dtype: float64

In [47]:
# S2_2
# One possible way students might get the values
# This is the minimal effort. This student will have the types of pages or events,
# but not have the quantities of users in each group.

pages = df.event_page.unique()

In [48]:
events = df.event_type.unique()

In [53]:
# This and the the following code samples are examples which are 
# slightly better than the first.

# Now the student might see which page or which event type is most imporant, 
# but not have a quantitative reason for the combination of the two.

# Said another way - they may jump to the conclusion that "Item Page" and "View" is the right answer.

# Simpsons paradox could trick them into thinking the independent sums are large when combined
df.groupby('event_page')['event_uuid'].nunique().reset_index(name='count')


Unnamed: 0,event_page,count
0,home_page,125106
1,item_page,500092
2,marketing_page,375463
3,search_page,250290


In [54]:


df.groupby('event_type')['event_uuid'].nunique().reset_index(name='count')




Unnamed: 0,event_type,count
0,cart,36090
1,favorite,194562
2,purchase,4907
3,share,194226
4,view,821166


In [57]:
df.groupby(['event_page', 'event_type'])['session_uuid'].nunique()
# In this example the student has moved on to looking for the most impactful area
# They've managed to count the number of sessions at the event page / event type level.

# If they use this method and then manually get the numbers that's a hard way to do it,
# but the correct answer
#list(itertools.product(pages,events))


event_page      event_type
home_page       cart            3411
                favorite       17802
                purchase         454
                share          17581
                view           64873
item_page       cart           11047
                favorite       54499
                purchase        1548
                share          54622
                view          151485
marketing_page  cart            8858
                favorite       44463
                purchase        1246
                share          44563
                view          132680
search_page     cart            6194
                favorite       32657
                purchase         891
                share          32395
                view          105821
Name: session_uuid, dtype: int64

In [59]:
# This student has made their life really easy.
# They've not only found the number of sessions per device/region combination, 
# but they've also just rank ordered them - so they know where to start
df.groupby(['event_page', 'event_type'])['session_uuid'].nunique().rank(method='dense')

event_page      event_type
home_page       cart           5.0
                favorite      10.0
                purchase       1.0
                share          9.0
                view          17.0
item_page       cart           8.0
                favorite      15.0
                purchase       4.0
                share         16.0
                view          20.0
marketing_page  cart           7.0
                favorite      13.0
                purchase       3.0
                share         14.0
                view          19.0
search_page     cart           6.0
                favorite      12.0
                purchase       2.0
                share         11.0
                view          18.0
Name: session_uuid, dtype: float64

In [49]:
# If they use this method and then manually get the numbers that's a hard way to do it,
# but it's still the correct answer

list(itertools.product(pages, events))

[('item_page', 'view'),
 ('item_page', 'favorite'),
 ('item_page', 'share'),
 ('item_page', 'cart'),
 ('item_page', 'purchase'),
 ('marketing_page', 'view'),
 ('marketing_page', 'favorite'),
 ('marketing_page', 'share'),
 ('marketing_page', 'cart'),
 ('marketing_page', 'purchase'),
 ('home_page', 'view'),
 ('home_page', 'favorite'),
 ('home_page', 'share'),
 ('home_page', 'cart'),
 ('home_page', 'purchase'),
 ('search_page', 'view'),
 ('search_page', 'favorite'),
 ('search_page', 'share'),
 ('search_page', 'cart'),
 ('search_page', 'purchase')]

### Section 2 - Solution Summary

1) The student has identified device type and region as the actionable elements of their personas as a first step.  

2) The student has identified the item page is getting the most traffic and is the best opporunity to show traction in the short term.



## Section 3 - Generate Hypotheses and Recommended Next Step [STUDENT]

Now you're going to generate hypotheses based on your personas, journey, and risks & opportunities

## Section 3 - Generate Hypotheses and Recommended Next Step [SOLUTION]

Title | Evidence | Product Change | Audience | Magnitude | Metric | Time
--- | --- | --- | --- | --- | --- | ---
Get Dinner| I'm starving| Eat food | Me | 100% | Less Hungry | 15 minutes |








