# Lambda School Data Science - Making Data-backed Assertions

This is, for many, the main point of data science - to create and support reasoned arguments based on evidence. It's not a topic to master in a day, but it is worth some focused time thinking about and structuring your approach to it.

## Lecture - generating a confounding variable

The prewatch material told a story about a hypothetical health condition where both the drug usage and overall health outcome were related to gender - thus making gender a confounding variable, obfuscating the possible relationship between the drug and the outcome.

Let's use Python to generate data that actually behaves in this fashion!

In [4]:
import random
dir(random)  # Reminding ourselves what we can do here

['BPF',
 'LOG4',
 'NV_MAGICCONST',
 'RECIP_BPF',
 'Random',
 'SG_MAGICCONST',
 'SystemRandom',
 'TWOPI',
 '_BuiltinMethodType',
 '_MethodType',
 '_Sequence',
 '_Set',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acos',
 '_bisect',
 '_ceil',
 '_cos',
 '_e',
 '_exp',
 '_inst',
 '_itertools',
 '_log',
 '_os',
 '_pi',
 '_random',
 '_sha512',
 '_sin',
 '_sqrt',
 '_test',
 '_test_generator',
 '_urandom',
 '_warn',
 'betavariate',
 'choice',
 'choices',
 'expovariate',
 'gammavariate',
 'gauss',
 'getrandbits',
 'getstate',
 'lognormvariate',
 'normalvariate',
 'paretovariate',
 'randint',
 'random',
 'randrange',
 'sample',
 'seed',
 'setstate',
 'shuffle',
 'triangular',
 'uniform',
 'vonmisesvariate',
 'weibullvariate']

In [5]:
# Let's think of another scenario:
# We work for a company that sells accessories for mobile phones.
# They have an ecommerce site, and we are supposed to analyze logs
# to determine what sort of usage is related to purchases, and thus guide
# website development to encourage higher conversion.

# The hypothesis - users who spend longer on the site tend
# to spend more. Seems reasonable, no?

# But there's a confounding variable! If they're on a phone, they:
# a) Spend less time on the site, but
# b) Are more likely to be interested in the actual products!

# Let's use namedtuple to represent our data


from collections import namedtuple
# purchased and mobile are bools, time_on_site in seconds
User = namedtuple('User', ['purchased','time_on_site', 'mobile'])

example_user = User(False, 12, False)
print(example_user)

User(purchased=False, time_on_site=12, mobile=False)


In [6]:
# And now let's generate 1000 example users
# 750 mobile, 250 not (i.e. desktop)
# A desktop user has a base conversion likelihood of 10%
# And it goes up by 1% for each 15 seconds they spend on the site
# And they spend anywhere from 10 seconds to 10 minutes on the site (uniform)
# Mobile users spend on average half as much time on the site as desktop
# But have three times as much base likelihood of buying something

users = []

for _ in range(250):
  # Desktop users
  time_on_site = random.uniform(10, 600)
  purchased = random.random() < 0.1 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, False))
  
for _ in range(750):
  # Mobile users
  time_on_site = random.uniform(5, 300)
  purchased = random.random() < 0.3 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, True))
  
random.shuffle(users)
print(users[:10])

[User(purchased=False, time_on_site=138.50275651693582, mobile=False), User(purchased=False, time_on_site=267.5929793627306, mobile=True), User(purchased=False, time_on_site=199.2781480394963, mobile=True), User(purchased=False, time_on_site=32.29228892141426, mobile=True), User(purchased=False, time_on_site=31.84506639783792, mobile=True), User(purchased=True, time_on_site=431.88972126473874, mobile=False), User(purchased=False, time_on_site=102.56392946463784, mobile=True), User(purchased=True, time_on_site=247.6425339558971, mobile=True), User(purchased=True, time_on_site=32.08563569511377, mobile=True), User(purchased=False, time_on_site=292.8609400348299, mobile=True)]


In [7]:
# Let's put this in a dataframe so we can look at it more easily
import pandas as pd
user_data = pd.DataFrame(users)
user_data.head()

ImportError: cannot import name 'NotebookFormatter' from 'pandas.io.formats.html' (C:\Users\19032\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\formats\html.py)

   purchased  time_on_site  mobile
0      False    138.502757   False
1      False    267.592979    True
2      False    199.278148    True
3      False     32.292289    True
4      False     31.845066    True

In [8]:
# Let's use crosstabulation to try to see what's going on
pd.crosstab(user_data['purchased'], user_data['time_on_site'])

ImportError: cannot import name 'NotebookFormatter' from 'pandas.io.formats.html' (C:\Users\19032\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\formats\html.py)

time_on_site  5.527746    6.315849    6.788962    6.856021    6.945923    \
purchased                                                                  
False                  0           0           1           1           1   
True                   1           1           0           0           0   

time_on_site  6.965220    7.026436    7.344258    7.639928    7.869767    ...  \
purchased                                                                 ...   
False                  1           1           1           0           1  ...   
True                   0           0           0           1           0  ...   

time_on_site  544.587275  547.240393  555.338996  555.859888  576.777186  \
purchased                                                                  
False                  1           1           1           1           0   
True                   0           0           0           0           1   

time_on_site  576.856173  580.990263  587.543913  590.138608  595

In [9]:
# OK, that's not quite what we want
# Time is continuous! We need to put it in discrete buckets
# Pandas calls these bins, and pandas.cut helps make them
import pandas as pd 
#time_bins = pd.cut(user_data['time_on_site'], 5)  # 5 equal-sized bins
#pd.crosstab(user_data['purchased'], time_bins)

#user_data['time_on_site'].hist(bin=15)


In [6]:
# We can make this a bit clearer by normalizing (getting %)
pd.crosstab(user_data['purchased'], time_bins, normalize='columns')

NameError: name 'user_data' is not defined

In [0]:
# That seems counter to our hypothesis
# More time on the site can actually have fewer purchases

# But we know why, since we generated the data!
# Let's look at mobile and purchased
pd.crosstab(user_data['purchased'], user_data['mobile'], normalize='columns')

mobile,False,True
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.672,0.622667
True,0.328,0.377333


In [0]:
# Yep, mobile users are more likely to buy things
# But we're still not seeing the *whole* story until we look at all 3 at once

# Live/stretch goal - how can we do that?