<a href="https://colab.research.google.com/github/ckornhiser411/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/Copy_of_LS_DS_114_Making_Data_backed_Assertions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - Making Data-backed Assertions

This is, for many, the main point of data science - to create and support reasoned arguments based on evidence. It's not a topic to master in a day, but it is worth some focused time thinking about and structuring your approach to it.

## Lecture - generating a confounding variable

The prewatch material told a story about a hypothetical health condition where both the drug usage and overall health outcome were related to gender - thus making gender a confounding variable, obfuscating the possible relationship between the drug and the outcome.

Let's use Python to generate data that actually behaves in this fashion!

In [2]:
import random
dir(random)  # Reminding ourselves what we can do here





['BPF',
 'LOG4',
 'NV_MAGICCONST',
 'RECIP_BPF',
 'Random',
 'SG_MAGICCONST',
 'SystemRandom',
 'TWOPI',
 '_BuiltinMethodType',
 '_MethodType',
 '_Sequence',
 '_Set',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acos',
 '_bisect',
 '_ceil',
 '_cos',
 '_e',
 '_exp',
 '_inst',
 '_itertools',
 '_log',
 '_pi',
 '_random',
 '_sha512',
 '_sin',
 '_sqrt',
 '_test',
 '_test_generator',
 '_urandom',
 '_warn',
 'betavariate',
 'choice',
 'choices',
 'expovariate',
 'gammavariate',
 'gauss',
 'getrandbits',
 'getstate',
 'lognormvariate',
 'normalvariate',
 'paretovariate',
 'randint',
 'random',
 'randrange',
 'sample',
 'seed',
 'setstate',
 'shuffle',
 'triangular',
 'uniform',
 'vonmisesvariate',
 'weibullvariate']

In [3]:
# Let's think of another scenario:
# We work for a company that sells accessories for mobile phones.
# They have an ecommerce site, and we are supposed to analyze logs
# to determine what sort of usage is related to purchases, and thus guide
# website development to encourage higher conversion.

# The hypothesis - users who spend longer on the site tend
# to spend more. Seems reasonable, no?

# But there's a confounding variable! If they're on a phone, they:
# a) Spend less time on the site, but
# b) Are more likely to be interested in the actual products!

# Let's use namedtuple to represent our data

from collections import namedtuple
# purchased and mobile are bools, time_on_site in seconds
User = namedtuple('User', ['purchased','time_on_site', 'mobile'])

example_user = User(False, 12, False)
print(example_user)

User(purchased=False, time_on_site=12, mobile=False)


In [4]:
# And now let's generate 1000 example users
# 750 mobile, 250 not (i.e. desktop)
# A desktop user has a base conversion likelihood of 10%
# And it goes up by 1% for each 15 seconds they spend on the site
# And they spend anywhere from 10 seconds to 10 minutes on the site (uniform)
# Mobile users spend on average half as much time on the site as desktop
# But have three times as much base likelihood of buying something

users = []

for _ in range(250):
  # Desktop users
  time_on_site = random.uniform(10, 600)
  purchased = random.random() < 0.1 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, False))
  
for _ in range(750):
  # Mobile users
  time_on_site = random.uniform(5, 300)
  purchased = random.random() < 0.3 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, True))
  
random.shuffle(users)
print(users[:10])

[User(purchased=False, time_on_site=506.00184956648866, mobile=False), User(purchased=True, time_on_site=29.885309674640023, mobile=True), User(purchased=False, time_on_site=161.15268827830772, mobile=True), User(purchased=False, time_on_site=195.8850614308432, mobile=False), User(purchased=False, time_on_site=256.70310750804344, mobile=True), User(purchased=False, time_on_site=469.27207097802847, mobile=False), User(purchased=True, time_on_site=118.55327583145818, mobile=True), User(purchased=False, time_on_site=285.0360803613483, mobile=True), User(purchased=False, time_on_site=270.8688424654102, mobile=False), User(purchased=True, time_on_site=263.0139059016244, mobile=True)]


In [5]:
# Let's put this in a dataframe so we can look at it more easily
import pandas as pd
user_data = pd.DataFrame(users)
user_data.head()

Unnamed: 0,purchased,time_on_site,mobile
0,False,506.00185,False
1,True,29.88531,True
2,False,161.152688,True
3,False,195.885061,False
4,False,256.703108,True


In [7]:
# Let's use crosstabulation to try to see what's going on
pd.crosstab(user_data['purchased'], user_data['time_on_site'])

time_on_site,5.707044316034292,5.730166273563265,7.549350488808691,8.195207654936764,8.550792540984746,8.724316832103028,8.727100626957398,8.987104366553087,9.403203905070326,10.18520712092565,11.098159673037433,11.10092259036854,11.23725501502453,11.553535036595587,11.91565630555065,12.004728370860219,12.703469892578308,12.708599830743653,13.085597324085105,13.164045060956958,13.291995233977584,14.016858499786798,14.350458915869845,14.753994466665386,15.683246779037379,15.790603189806793,17.220587776842812,17.36923115024014,17.477006761250834,17.681001400574676,18.332808109084247,18.96712992638418,19.149263417923844,19.95767007850413,20.34941011290097,20.77402107513099,20.797126438388887,20.816657425956404,20.864479184532144,21.42153394816957,...,492.51182318164905,493.18475638871513,494.568908514612,495.14580009703064,498.4861947627664,500.97817977253857,502.15496610962236,502.863081759148,503.6206801018247,506.00184956648866,514.1970988195785,515.3885284205296,518.5706484222865,518.7503532302561,522.6132629197372,524.8422875253555,536.7433647903537,536.8859015659693,537.0752995817188,538.31520394842,539.6261949463808,540.9689210164545,542.4688285737924,543.5149171100618,546.6655115575552,547.3092428251533,551.657902303687,558.7598542169743,567.922041964437,568.4011545668243,568.8782237152349,575.0509618682668,576.4330722161458,578.8945514540654,579.0952730686831,579.3813386788499,581.5436172912584,582.4398347517956,592.3320792046729,595.6509937151247
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
False,0,1,1,1,0,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,0,1,0,1,1,1,1,0,1,1,0,...,0,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,0,1,0,1,1,1,1,0,0,1,1,0,1,1,1,0,1,0,1,0,0,0,1
True,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,...,1,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,1,0,1,1,1,0


In [83]:
# !pip freeze #0.24.2
!pip install pandas==0.23.4

Collecting pandas==0.23.4
[?25l  Downloading https://files.pythonhosted.org/packages/e1/d8/feeb346d41f181e83fba45224ab14a8d8af019b48af742e047f3845d8cff/pandas-0.23.4-cp36-cp36m-manylinux1_x86_64.whl (8.9MB)
[K     |████████████████████████████████| 8.9MB 2.8MB/s 
[31mERROR: google-colab 1.0.0 has requirement pandas~=0.24.0, but you'll have pandas 0.23.4 which is incompatible.[0m
Installing collected packages: pandas
  Found existing installation: pandas 0.24.2
    Uninstalling pandas-0.24.2:
      Successfully uninstalled pandas-0.24.2
Successfully installed pandas-0.23.4


In [8]:
# OK, that's not quite what we want
# Time is continuous! We need to put it in discrete buckets
# Pandas calls these bins, and pandas.cut helps make them

time_bins = pd.cut(user_data['time_on_site'], 5)  # 5 equal-sized bins
pd.crosstab(user_data['purchased'], time_bins)

time_on_site,"(5.117, 123.696]","(123.696, 241.685]","(241.685, 359.673]","(359.673, 477.662]","(477.662, 595.651]"
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,241,203,119,34,27
True,117,128,85,26,20


In [9]:
# We can make this a bit clearer by normalizing (getting %)
pd.crosstab(user_data['purchased'], time_bins, normalize='columns')

time_on_site,"(5.117, 123.696]","(123.696, 241.685]","(241.685, 359.673]","(359.673, 477.662]","(477.662, 595.651]"
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,0.673184,0.613293,0.583333,0.566667,0.574468
True,0.326816,0.386707,0.416667,0.433333,0.425532


In [10]:
# That seems counter to our hypothesis
# More time on the site can actually have fewer purchases

# But we know why, since we generated the data!
# Let's look at mobile and purchased
pd.crosstab(user_data['purchased'], user_data['mobile'], normalize='columns')

mobile,False,True
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.676,0.606667
True,0.324,0.393333


In [0]:
# Yep, mobile users are more likely to buy things
# But we're still not seeing the *whole* story until we look at all 3 at once

# Live/stretch goal - how can we do that?

## Assignment - what's going on here?

Consider the data in `persons.csv` (already prepared for you, in the repo for the week). It has four columns - a unique id, followed by age (in years), weight (in lbs), and exercise time (in minutes/week) of 1200 (hypothetical) people.

Try to figure out which variables are possibly related to each other, and which may be confounding relationships.

In [0]:
# TODO - your code here
# Use what we did live in lecture as an example

# HINT - you can find the raw URL on GitHub and potentially use that
# to load the data with read_csv, or you can upload it yourself

import pandas as pd

data = ('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-1-Sprint-1-Dealing-With-Data/master/module3-databackedassertions/persons.csv')
df = pd.read_csv (data)

In [13]:
df.head()


Unnamed: 0.1,Unnamed: 0,age,weight,exercise_time
0,0,44,118,192
1,1,41,161,35
2,2,46,128,220
3,3,39,216,57
4,4,28,116,182


In [14]:
df.columns

Index(['Unnamed: 0', 'age', 'weight', 'exercise_time'], dtype='object')

In [15]:
df.rename (columns = {'Unnamed: 0' : 'Persons'}, inplace=True)

df.columns
#df.isnull

Index(['Persons', 'age', 'weight', 'exercise_time'], dtype='object')

In [0]:
#pd.crosstab(df['exercise_time'], df['weight'])

In [18]:
time_bins = pd.cut(df['exercise_time'], 5)  # 5 equal-sized bins
pd.crosstab(df['weight'], time_bins, normalize = 'columns')

exercise_time,"(-0.3, 60.0]","(60.0, 120.0]","(120.0, 180.0]","(180.0, 240.0]","(240.0, 300.0]"
weight,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100,0.017986,0.003165,0.004484,0.020833,0.020942
101,0.000000,0.003165,0.008969,0.026042,0.010471
102,0.003597,0.006329,0.013453,0.015625,0.020942
103,0.003597,0.006329,0.008969,0.000000,0.015707
104,0.007194,0.006329,0.004484,0.010417,0.010471
105,0.010791,0.000000,0.008969,0.005208,0.020942
106,0.003597,0.006329,0.008969,0.005208,0.010471
107,0.000000,0.006329,0.017937,0.015625,0.010471
108,0.003597,0.006329,0.026906,0.020833,0.031414
109,0.007194,0.012658,0.013453,0.020833,0.000000


### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  What are the variable types in the data?
2.  What are the relationships between the variables?
3.  Which relationships are "real", and which spurious?


## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub.

- [Spurious Correlations](http://tylervigen.com/spurious-correlations)
- [NIH on controlling for confounding variables](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017459/)

Stretch goals:

- Produce your own plot inspired by the Spurious Correlation visualizations (and consider writing a blog post about it - both the content and how you made it)
- Pick one of the techniques that NIH highlights for confounding variables - we'll be going into many of them later, but see if you can find which Python modules may help (hint - check scikit-learn)