### Familiar: A Study In Data Analysis

Welcome to Familiar, a startup in the new market of blood transfusion! You’ve joined the team because you appreciate the flexible hours and extremely intelligent team, but the overeager doorman welcoming you into the office is a nice way to start your workday (well, work-evening).

Familiar has fallen into some tough times lately, so you’re hoping to help them make some insights about their product and help move the needle (so to speak).


The Familiar team has provided us with some data on lifespans for subscribers to two different packages, the Vein Pack and the Artery Pack!

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Load datasets
lifespans = pd.read_csv('familiar_lifespan.csv')
iron = pd.read_csv('familiar_iron.csv')

In [2]:
lifespans.head()

Unnamed: 0,pack,lifespan
0,vein,76.25509
1,artery,76.404504
2,artery,75.952442
3,artery,76.923082
4,artery,73.771212


The first thing we want to know is whether Familiar’s most basic package, the Vein Pack, actually has a significant impact on the subscribers. It would be a marketing goldmine if we can show that subscribers to the Vein Pack live longer than other people.

Is it longer than 73 years?

In [3]:
#2,3
vein_pack = lifespans[lifespans.pack == 'vein']
vein_pack_lifespans = vein_pack.lifespan
print("Average vein pack", np.mean(vein_pack_lifespans))

Average vein pack 76.16901335636044


We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy of 73 years.

We would use to test the following null and alternative hypotheses:

* Null: The average lifespan of a Vein Pack subscriber is 73 years.
* Alternative: The average lifespan of a Vein Pack subscriber is NOT 73 years.

In [4]:
#4,5 One Sample T-test
from scipy.stats import ttest_1samp
tstat, pval_vein = ttest_1samp(vein_pack_lifespans, 73)
print("P-value:", pval_vein)
print("Since our p-value is less than 0.05 it mean that we can reject null hypothesis and say that statistical difference is significant and therefore Vein pack subscribers have higher average lifespan")

P-value: 5.972157921433211e-07
Since our p-value is less than 0.05 it mean that we can reject null hypothesis and say that statistical difference is significant and therefore Vein pack subscribers have higher average lifespan


In order to differentiate Familiar’s different product lines, we’d like to compare this lifespan data between our different packages. Our next step up from the Vein Pack is the Artery Pack. 

Is it longer than for the Vein Pack?

In [5]:
#6,7
artery_pack = lifespans[lifespans.pack == 'artery']
artery_pack_lifespans = artery_pack.lifespan
print("Average Artery pack", np.mean(artery_pack_lifespans))

Average Artery pack 74.87366223517039


We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy for the Artery Pack.

We would use to test the following null and alternative hypotheses:

* Null: The average lifespan of a Vein Pack subscriber is equal to the average lifespan of an Artery Pack subscriber.
* Alternative: The average lifespan of a Vein Pack subscriber is NOT equal to the average lifespan of an Artery Pack subscriber.

In [6]:
#8,9 Two Sample T-test
from scipy.stats import ttest_ind
tstat, pval_two = ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print("P-value:", pval_two)
print("Since our p-value is more than 0.05 threshold it means that we can accept null hypothesis and say that statistical difference is not significant and therefore Vein pack and Artery pack subscribers have similar lifespan")

P-value: 0.05588883079070819
Since our p-value is more than 0.05 threshold it means that we can accept null hypothesis and say that statistical difference is not significant and therefore Vein pack and Artery pack subscribers have similar lifespan


The Familiar team has provided us with another dataset containing survey data about iron counts for our subscribers. This data has been pre-processed to categorize iron counts as “low”, “normal”, and “high” for each subscriber. Familiar wants to be able to advise potential subscribers about possible side effects of these packs and whether they differ for the Vein vs. the Artery pack.


Is there an association between the pack that a subscriber gets (Vein vs. Artery) and their iron level? 

In [7]:
iron.head()

Unnamed: 0,pack,iron
0,vein,low
1,artery,normal
2,artery,normal
3,artery,normal
4,artery,high


We’d like to find out if there is a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

We would use to test the following null and alternative hypotheses:

* Null: There is NOT an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.
* Alternative: There is an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

In [8]:
#11 Contingency table creation
Xtab = pd.crosstab(iron.pack, iron.iron)
print(Xtab)

iron    high  low  normal
pack                     
artery    87   29      29
vein      20  140      40


In [9]:
# Chi-Square Test
from scipy.stats import chi2_contingency
chi2, pval_chi, dof, expected = chi2_contingency(Xtab)
print("P-value:", pval_chi)
print("Since our p-value is less than 0.05 threshold we can reject null hypothesis and say that there is an association between which breed (poodle vs. shihtzu) and their color")

P-value: 9.359749337433008e-25
Since our p-value is less than 0.05 threshold we can reject null hypothesis and say that there is an association between which breed (poodle vs. shihtzu) and their color
