<a href="https://colab.research.google.com/github/KryssyCo/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/Krista_Shepard_DS5_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - Making Data-backed Assertions

This is, for many, the main point of data science - to create and support reasoned arguments based on evidence. It's not a topic to master in a day, but it is worth some focused time thinking about and structuring your approach to it.

## Lecture - generating a confounding variable

The prewatch material told a story about a hypothetical health condition where both the drug usage and overall health outcome were related to gender - thus making gender a confounding variable, obfuscating the possible relationship between the drug and the outcome.

Let's use Python to generate data that actually behaves in this fashion!

In [0]:
import random
dir(random)  # Reminding ourselves what we can do here

['BPF',
 'LOG4',
 'NV_MAGICCONST',
 'RECIP_BPF',
 'Random',
 'SG_MAGICCONST',
 'SystemRandom',
 'TWOPI',
 '_BuiltinMethodType',
 '_MethodType',
 '_Sequence',
 '_Set',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acos',
 '_bisect',
 '_ceil',
 '_cos',
 '_e',
 '_exp',
 '_inst',
 '_itertools',
 '_log',
 '_pi',
 '_random',
 '_sha512',
 '_sin',
 '_sqrt',
 '_test',
 '_test_generator',
 '_urandom',
 '_warn',
 'betavariate',
 'choice',
 'choices',
 'expovariate',
 'gammavariate',
 'gauss',
 'getrandbits',
 'getstate',
 'lognormvariate',
 'normalvariate',
 'paretovariate',
 'randint',
 'random',
 'randrange',
 'sample',
 'seed',
 'setstate',
 'shuffle',
 'triangular',
 'uniform',
 'vonmisesvariate',
 'weibullvariate']

In [0]:
# Let's think of another scenario:
# We work for a company that sells accessories for mobile phones.
# They have an ecommerce site, and we are supposed to analyze logs
# to determine what sort of usage is related to purchases, and thus guide
# website development to encourage higher conversion.

# The hypothesis - users who spend longer on the site tend
# to spend more. Seems reasonable, no?

# But there's a confounding variable! If they're on a phone, they:
# a) Spend less time on the site, but
# b) Are more likely to be interested in the actual products!

# Let's use namedtuple to represent our data

from collections import namedtuple
# purchased and mobile are bools, time_on_site in seconds
User = namedtuple('User', ['purchased','time_on_site', 'mobile'])

example_user = User(False, 12, False)
print(example_user)

User(purchased=False, time_on_site=12, mobile=False)


In [0]:
# And now let's generate 1000 example users
# 750 mobile, 250 not (i.e. desktop)
# A desktop user has a base conversion likelihood of 10%
# And it goes up by 1% for each 15 seconds they spend on the site
# And they spend anywhere from 10 seconds to 10 minutes on the site (uniform)
# Mobile users spend on average half as much time on the site as desktop
# But have three times as much base likelihood of buying something

users = []

for _ in range(250):
  # Desktop users
  time_on_site = random.uniform(10, 600)
  purchased = random.random() < 0.1 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, False))
  
for _ in range(750):
  # Mobile users
  time_on_site = random.uniform(5, 300)
  purchased = random.random() < 0.3 + (time_on_site / 1500)
  users.append(User(purchased, time_on_site, True))
  
random.shuffle(users)
print(users[:10])

[User(purchased=False, time_on_site=6.744591919820673, mobile=True), User(purchased=False, time_on_site=260.5125797775788, mobile=True), User(purchased=True, time_on_site=182.63944678884343, mobile=True), User(purchased=True, time_on_site=128.56525807832793, mobile=True), User(purchased=True, time_on_site=188.0568711970568, mobile=True), User(purchased=True, time_on_site=275.0509124995463, mobile=True), User(purchased=True, time_on_site=155.77365008296277, mobile=True), User(purchased=False, time_on_site=119.52951789018366, mobile=True), User(purchased=False, time_on_site=87.24939500170488, mobile=True), User(purchased=True, time_on_site=80.97138640457563, mobile=True)]


In [0]:
# Let's put this in a dataframe so we can look at it more easily
import pandas as pd
user_data = pd.DataFrame(users)
user_data.head()

Unnamed: 0,purchased,time_on_site,mobile
0,False,6.744592,True
1,False,260.51258,True
2,True,182.639447,True
3,True,128.565258,True
4,True,188.056871,True


In [0]:
# Let's use crosstabulation to try to see what's going on
pd.crosstab(user_data['purchased'], user_data['time_on_site'])

time_on_site,5.803221869962067,5.9457407221367955,6.03714652345346,6.159911401978521,6.744591919820673,6.923252656806344,7.043016078813782,7.182273129078421,7.679453635480238,8.136172200800527,8.183425932404369,8.244321141643224,8.67997105199326,8.745335944196846,9.374638377987973,10.668822384704715,10.818992279932786,10.876516694229892,10.956563251269323,11.200904503122327,11.752991067124814,12.161643817282897,12.831444599510117,13.303169241208762,13.329823373402226,13.610151986393568,14.380833163142363,15.227930341468474,15.25535888915947,15.264314702393028,15.761514037245586,15.876709897573788,16.0516501601046,16.136204777317225,17.93995282856358,18.399076827523842,18.44610676823452,19.011171848718448,19.265041198489477,19.66142600180024,...,520.7986258105194,521.4247023382302,528.7929475438956,534.3614607147614,534.8887280381758,535.4681326175245,538.0328658743189,539.2809153799716,543.2766786631039,543.4935035699468,544.3199524068077,546.8459900344104,548.1286488935908,554.7086229167335,556.5758884013085,558.2234549000223,561.069739480919,562.0747240952522,562.7177168853427,564.4856703381535,565.7372738952068,566.0103865296826,566.9182903575502,568.4992876133292,569.0273287792571,571.1335328589765,573.449095874549,573.6417250806675,575.6654012907818,576.0320255334888,576.5749946091761,576.960551273148,579.7356463731574,579.9094735964858,581.7418723953473,590.8043667010311,593.7932678650694,594.9124443664989,598.4028714195957,599.3783963747298
purchased,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
False,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,0,1,0,1,1,1,1,1,0,1,1,0,1,1,1,1,0,0,1,...,1,0,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,0,1,0,1,1,1,0,1,1,1
True,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,...,0,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,0


In [0]:
# OK, that's not quite what we want
# Time is continuous! We need to put it in discrete buckets
# Pandas calls these bins, and pandas.cut helps make them

time_bins = pd.cut(user_data['time_on_site'], 5)  # 5 equal-sized bins
pd.crosstab(user_data['purchased'], time_bins)

TypeError: ignored

In [0]:
# We can make this a bit clearer by normalizing (getting %)
pd.crosstab(user_data['purchased'], time_bins, normalize='columns')

In [0]:
# That seems counter to our hypothesis
# More time on the site can actually have fewer purchases

# But we know why, since we generated the data!
# Let's look at mobile and purchased
pd.crosstab(user_data['purchased'], user_data['mobile'], normalize='columns')

In [0]:
# Yep, mobile users are more likely to buy things
# But we're still not seeing the *whole* story until we look at all 3 at once

# Live/stretch goal - how can we do that?

## Assignment - what's going on here?

Consider the data in `persons.csv` (already prepared for you, in the repo for the week). It has four columns - a unique id, followed by age (in years), weight (in lbs), and exercise time (in minutes/week) of 1200 (hypothetical) people.

Try to figure out which variables are possibly related to each other, and which may be confounding relationships.

In [0]:
import pandas as pd
persons = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-1-Sprint-1-Dealing-With-Data/master/module3-databackedassertions/persons.csv')
persons.head

**My hypothesis is that people who spend more time exercising, will weigh less. So a persons weight is dependant on how much time they exercise. However, their age is definitely a confounding relationship, because their has been medical research that has shown that it is harder to lose weight as you get older. Also, I think a significant piece of data is missing, specifically, the biological gender of the person or people in question.**

In [0]:
from collections import namedtuple
# age, weight and exercise time are all ints = age in years, weight in pounds
# and exercise time in minutes.
people = namedtuple('people', ['age', 'exercise_time', 'weight'])
example_people = people(24, 30, 118)
print(example_people)
# Just FYI I follow each step from lecture the first time, and get brave once 
# concepts are clear to me.

In [0]:
persons.describe()

In [0]:
persons.dtypes # type of data = integers

  

In [0]:
persons.isnull().sum() # No missing data

In [0]:
pd.crosstab(persons['weight'], persons['exercise_time'])

#Used crosstabulation to try to see what is going on, and if you took a lot of 
#time, I'm sure you could draw some sort of conclusion, but it would be very time 
#consuming. The current results show exercise_time by the minute. I need bins!

In [39]:
!pip install pandas==0.23.4

Collecting pandas==0.23.4
[?25l  Downloading https://files.pythonhosted.org/packages/e1/d8/feeb346d41f181e83fba45224ab14a8d8af019b48af742e047f3845d8cff/pandas-0.23.4-cp36-cp36m-manylinux1_x86_64.whl (8.9MB)
[K     |████████████████████████████████| 8.9MB 4.9MB/s 
[31mERROR: google-colab 1.0.0 has requirement pandas~=0.24.0, but you'll have pandas 0.23.4 which is incompatible.[0m
Installing collected packages: pandas
  Found existing installation: pandas 0.24.2
    Uninstalling pandas-0.24.2:
      Successfully uninstalled pandas-0.24.2
Successfully installed pandas-0.23.4


In [17]:
time_bins = pd.cut(persons['exercise_time'], 6)
pd.crosstab(persons['weight'], time_bins)
 # I also got the pandas error and had to address it.
 # Analyzing this information, one might think, my initial hypothesis 
 # was incorrect. The amount of time does not seem to have an effect on weight.
 # (e.g a person who exercises 250 to 300 minutes per day can weigh the same 
 # amounts as a person who does not exercise or doesn't exercise more than 50 minutes)

exercise_time,"(-0.3, 50.0]","(50.0, 100.0]","(100.0, 150.0]","(150.0, 200.0]","(200.0, 250.0]","(250.0, 300.0]"
weight,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100,4,2,0,2,3,4
101,0,1,1,4,2,2
102,1,2,2,1,4,3
103,1,2,0,2,0,3
104,2,2,1,0,3,1
105,3,0,0,3,1,3
106,0,1,2,3,0,2
107,0,1,3,3,2,2
108,0,3,3,3,5,5
109,2,2,3,4,2,0


In [0]:
pd.crosstab(persons['weight'], time_bins, normalize ='columns') 
#Normalizing the columns helps to get a better idea of the amount of people in
#each weight category and how much time they exercise. However, by looking at 
#the data in a percentage, doesn't really seem clear or necessary.

### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  What are the variable types in the data?
2.  What are the relationships between the variables?
3.  Which relationships are "real", and which spurious?


## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub.

- [Spurious Correlations](http://tylervigen.com/spurious-correlations)
- [NIH on controlling for confounding variables](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017459/)

Stretch goals:

- Produce your own plot inspired by the Spurious Correlation visualizations (and consider writing a blog post about it - both the content and how you made it)
- Pick one of the techniques that NIH highlights for confounding variables - we'll be going into many of them later, but see if you can find which Python modules may help (hint - check scikit-learn)