## Data Challenge 2 - Cody Schank

### Assignment: 
Company XYZ sells a software for \$39. Since revenue has been flat for some time, the VP of Product has decided to run a test increasing the price. She hopes that this would increase revenue. In the experiment, 66% of the users have seen the old price (\$39), while a random sample of 33% users a higher price (\$59).

### This is a classic A/B test, but we should also consider user segmentation as much as possible.

I analyzed the results of the pricing test to determine whether the company should increase the price of it's sofware from \$39 to \$59. I found that a simple approach that ignores customer segmentation shows that the price of the software should be increased across the board, resulting in a revenue increase of 18% (question 1).  

Also, I found that Mac & iOS users have higher conversion rates for the software, while friend referrals are (unsuprisingly) the most important source of new customers. Thinking only about increasing conversion rates, the company should direct their ads to Mac & iOS users, and encourage users to refer their friends to use the software (question 2). Two sources, bing and yahoo search engine results, have higher revenue for the original price of the software (\$39). The company should consider keeping the price at \$39 for users who find the software through these methods (though potentially that could lead to other users using that method if this workaround is discovered, which it likely would be).

### First, let's load the necessary libraries and read in the data, and do some data clean up (these steps were determined necessary by some initial exploratory analysis of the data).

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
test_results = pd.read_csv("./data/test_results.csv")
user_table = pd.read_csv("./data/user_table.csv")

In [3]:
test_results = test_results.drop(columns='Unnamed: 0')
test_results.loc[test_results.test == 0, 'price'] = 39.0
test_results.loc[test_results.test == 1, 'price'] = 59.0
test_results.loc[test_results.operative_system == 'anroid', 'operative_system'] = 'android'
test_results.loc[test_results.operative_system == 'i0S', 'operative_system'] = 'iOS'
test_results.loc[test_results.operative_system == 'wnddows', 'operative_system'] = 'windows'
test_results.loc[test_results.operative_system == 'apple', 'operative_system'] = 'mac'

### Now, let's ask the first question. Does it make sense to increase the price of the sofware across the board? To answer this, I grouped by test, calculated the conversion rate, and multiplied that rate by the price. 

In [6]:
segmented_results = test_results.groupby(['test']).mean()
segmented_results['revenue'] = segmented_results['converted']*segmented_results['price']
segmented_results = segmented_results.sort_values('revenue', ascending = False)
segmented_results

Unnamed: 0_level_0,user_id,price,converted,revenue
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,499483.228643,59.0,0.015543,0.917018
0,499167.741613,39.0,0.019904,0.776241


### According to this quick analysis, disregarding any user segmentation, we should sell the software for \$59. Assuming a population of 10,000 potential customers, 155 customers would buy the software at the \$59 price, netting \$9170. While 199 customers would buy the software at the \$39 price, netting  \$7763. Thus setting the price at \$59 will net an additional \$1407. An increase of more than 18% in revenue.

### Next, let's segment the users by the available features. Below we see that Mac/iOS users are a signficiant user base for the software. And that the price increase still makes sense even if break it down by operating system (i.e. the price increase always has the higher revenue for each operating system).

In [7]:
segmented_results = test_results[['operative_system','converted']].groupby(['operative_system']).mean()
segmented_results = segmented_results.sort_values('converted', ascending = False)
segmented_results

Unnamed: 0_level_0,converted
operative_system,Unnamed: 1_level_1
mac,0.023998
iOS,0.022322
windows,0.016955
android,0.014933


In [8]:
segmented_results = test_results[['test','operative_system','price','converted']].groupby(['test','operative_system']).mean()
segmented_results['revenue'] = segmented_results['converted']*segmented_results['price']
segmented_results = segmented_results.sort_values('revenue', ascending = False)
segmented_results

Unnamed: 0_level_0,Unnamed: 1_level_0,price,converted,revenue
test,operative_system,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,mac,59.0,0.021218,1.251864
1,iOS,59.0,0.019994,1.179649
0,mac,39.0,0.02548,0.993706
0,iOS,39.0,0.023587,0.919909
1,windows,59.0,0.01399,0.825385
0,windows,39.0,0.018698,0.729207
1,android,59.0,0.012348,0.728531
0,android,39.0,0.016366,0.63827


### This time we look at the source of the user, and find that friend referrals are an important way for the company to attract new customers. There are a few sources where the price increase does not make sense: seo_bing (-48%), seo_yahoo (-19%), ads_yahoo (-1%). Seo-bing has the second highest converion rate of any source, while seo-yahoo is near the middle. For users coming from these two sources, we should use the original price of \$39. 

In [9]:
segmented_results = test_results[['source','converted']].groupby(['source']).mean()
segmented_results = segmented_results.sort_values('converted', ascending = False)
segmented_results

Unnamed: 0_level_0,converted
source,Unnamed: 1_level_1
friend_referral,0.0387
seo-bing,0.023732
ads-google,0.021511
ads_facebook,0.021224
seo-google,0.017012
seo-yahoo,0.016238
seo_facebook,0.016049
seo-other,0.015676
ads-yahoo,0.014789
ads_other,0.014372


In [10]:
segmented_results = test_results[['test','source','price','converted','device']].groupby(['test','source']).mean()
segmented_results['revenue'] = segmented_results['converted']*segmented_results['price']
segmented_results = segmented_results.sort_values('revenue', ascending = False)
segmented_results

Unnamed: 0_level_0,Unnamed: 1_level_0,price,converted,revenue
test,source,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,friend_referral,59.0,0.033369,1.96875
0,friend_referral,39.0,0.041768,1.628963
0,seo-bing,39.0,0.030053,1.172051
1,ads-google,59.0,0.019641,1.158801
1,ads_facebook,59.0,0.016865,0.995018
1,seo-google,59.0,0.015989,0.943356
0,ads_facebook,39.0,0.023689,0.923883
0,ads-google,39.0,0.022573,0.880343
1,seo_facebook,59.0,0.013589,0.801734
1,seo-bing,59.0,0.013447,0.793399


### Finally, looking at the device (mobile vs web) we see not much difference between the two options, and that the price increase makes sense for both.

In [11]:
segmented_results = test_results[['device','converted']].groupby(['device']).mean()
segmented_results = segmented_results.sort_values('converted', ascending = False)
segmented_results

Unnamed: 0_level_0,converted
device,Unnamed: 1_level_1
mobile,0.018555
web,0.018016


In [12]:
segmented_results = test_results[['test','source','price','converted','device']].groupby(['test','device']).mean()
segmented_results['revenue'] = segmented_results['converted']*segmented_results['price']
segmented_results = segmented_results.sort_values('revenue', ascending = False)
segmented_results

Unnamed: 0_level_0,Unnamed: 1_level_0,price,converted,revenue
test,device,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,mobile,59.0,0.016139,0.95219
1,web,59.0,0.014725,0.868748
0,web,39.0,0.01994,0.777667
0,mobile,39.0,0.019879,0.775267


### Now I will look at sample size to see if the test was run for the appropriate length. 

In [13]:
segmented_results_count = test_results.groupby(['test','source']).size()
segmented_results_count.sort_values()

test  source         
1     seo-bing             818
0     seo-bing            1331
1     seo-yahoo           2485
      ads-yahoo           2761
      seo-other           3365
0     seo-yahoo           4351
      ads-yahoo           4812
      seo-other           5885
1     friend_referral     7552
      seo_facebook        7727
      seo-google          8068
      ads-bing            8256
      ads_other          10583
0     friend_referral    13120
      seo_facebook       13458
      ads-bing           14595
      seo-google         15092
      ads_other          19266
1     ads_facebook       19271
      ads-google         21486
      direct_traffic     21580
0     ads_facebook       34066
      ads-google         37833
      direct_traffic     38727
dtype: int64

In [14]:
segmented_results_count = test_results.groupby(['test','operative_system']).size()
segmented_results_count.sort_values()

test  operative_system
1     mac                  8719
0     mac                 16366
1     android             26725
      iOS                 33610
      windows             37385
0     android             48210
      iOS                 61855
      windows             63591
dtype: int64

In [15]:
segmented_results_count = test_results.groupby(['test','device']).size()
segmented_results_count.sort_values()

test  device
1     web        48083
      mobile     65990
0     web        82246
      mobile    120481
dtype: int64

### I ran out of time to analyze the sample size needed for this test. 

### Also, I ignored user location for the time being. But it might be interesting to explore the conversion rate across the different locations, and investigate any patterns that might emerge. Maybe there are higher conversion rates in cities with higher incomes, or where a particular industry is common that uses the type of software made by the company.