# Familiar: Data Analysis for Marketing Strategies

Familiar is a startup in the new market of blood transfusion! They offer their subcribers two packages called Vein and Artery. This study hopes to answer the following questions:

- Is the average lifespan of a Vein Pack subscriber significantly different from the average life expectancy of 73 years.

- Is the average lifespan of a Vein Pack subscriber significantly different from the average life expectancy for the Artery Pack?

- Is there a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level?

The company has provided two csv files with data on its subscribers in order to answer these questions. 

- familiar_lifespan.csv 
- familiar_iron.csv

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency

# Load datasets
lifespans = pd.read_csv('familiar_lifespan.csv')
iron = pd.read_csv('familiar_iron.csv')
print(lifespans.head())

     pack   lifespan
0    vein  76.255090
1  artery  76.404504
2  artery  75.952442
3  artery  76.923082
4  artery  73.771212


Based on this output, the familiar_lifespan.csv has two columns. The first being the pack (vein/artery) and the other being lifespan (ex. 76.255090) 

In [2]:
vein_pack_lifespans = lifespans.lifespan[lifespans.pack == 'vein']
avg_vein_pack_lifespans = np.mean(vein_pack_lifespans)
print("The average lifespan for subscribers of the Vein pack is {}".format(avg_vein_pack_lifespans))

The average lifespan for subscribers of the Vein pack is 76.16901335636044


Using this information we can perform a one sample t-test in order to see wether the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy of 73 years. We are using a significane threshold of 0.05

In [3]:
t_statistic, p_value = stats.ttest_1samp(vein_pack_lifespans, 73)
print("The P Value for this test is {}".format(p_value))

The P Value for this test is 5.972157921433201e-07


This is much smaller than 0.05, so *we conclude that the average lifespan of Vein Pack subscribers is significantly different from 73 years.*

Now lets look at the other pack called Artery, specifically its mean

In [4]:
artery_pack_lifespans = lifespans.lifespan[lifespans.pack == 'artery']
avg_artery_pack_lifespans = np.mean(artery_pack_lifespans)
print("The mean for the Artery pack is {}".format(avg_artery_pack_lifespans))

The mean for the Artery pack is 74.8736622351704


Now that we know this, we can answer the question of wether the average lifespan of a Vein Pack subscriber is significantly different from the average lifespan for the Artery Pack. We will be answering this using a two sample t-test.

In [5]:
t_statistic, p_value = stats.ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print("The P Value for this test is {}".format(p_value))

The P Value for this test is 0.055888830790708194


Although the average lifespan of subscribers to Vein and Artery packs are different, since this p-value fails to be below our significance threshold (0.05), *it is not significantly different*

Now we shift our attention to iron levels in subscribers of both packs and wether any statistical association exists.

In [6]:
print(iron.head())
print(iron.iron.unique())

     pack    iron
0    vein     low
1  artery  normal
2  artery  normal
3  artery  normal
4  artery    high
['low' 'normal' 'high']


We use the familiar_iron.csv file for this analysis. It has two columns as well, with one being pack (artery/vein) and the other for iron levels. The unique values for this iron column is ['low' 'normal' 'high']  

From this we can create a cross table and then perform a chi-square test on it.

In [7]:
contingency_table = pd.crosstab(iron.pack,iron.iron)
print(contingency_table)
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print("The P Value for this test is {}".format(p_value))

iron    high  low  normal
pack                     
artery    87   29      29
vein      20  140      40
The P Value for this test is 9.359749337433006e-25


This P-value lands well below our 0.05 signifigance threshold which means *there is a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level*