 ## Part A: Statistical Analysis

### Use the following information to complete tasks.

### Dataset description:

Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)?

Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the `Longbones.csv` dataset which you can find [here](https://github.com/LambdaSchool/data-science-practice-datasets/blob/main/unit_1/Longbones/Longbones.csv).

**What can we learn about the bodies that were buried in the cemetery?**

The variable names are:
* Site = Site ID, either Site 1 or Site 2
* Time = Interrment time in years
* Depth = Burial depth in ft.
* Lime = Burial with Quiklime (0 = No, 1 = Yes)
* Age = Age at time of death in years
* Nitro = Nitrogen composition of the long bones in g per 100g of bone.
* Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)


**Task 1** - Load the data

In [83]:
import pandas as pd
import numpy as np

data_url = 'https://raw.githubusercontent.com/pixeltests/datasets/main/Longbones.csv'
df = pd.read_csv(data_url)

df.head()

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
0,1,88.5,7.0,1,,3.88,1
1,1,88.5,,1,,4.0,1
2,1,85.2,7.0,1,,3.69,1
3,1,71.8,7.6,1,65.0,3.88,0
4,1,70.6,7.5,1,42.0,3.53,0


In [79]:
df.describe()

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
count,42.0,42.0,41.0,42.0,35.0,42.0,42.0
mean,1.428571,56.942857,6.87439,0.404762,34.342857,3.787857,0.071429
std,0.50087,20.512413,1.471824,0.496796,9.923065,0.179179,0.260661
min,1.0,26.5,4.0,0.0,19.0,3.27,0.0
25%,1.0,36.95,6.0,0.0,27.0,3.6925,0.0
50%,1.0,56.4,7.0,0.0,34.0,3.835,0.0
75%,2.0,71.35,8.0,1.0,39.5,3.92,0.0
max,2.0,93.6,9.25,1.0,65.0,4.06,1.0


In [80]:
df.shape

(42, 7)

**Task 2** - Missing data

Now, let's determine if there is any missing data in the dataset. If there is, drop the row that contains a missing value.

In [81]:
sum_null = df.isnull().sum().sum()
sum_null
df = df.dropna(inplace=True)

In [84]:
df.value_counts()

Site  Time  Depth  Lime  Age   Nitro  Oil
1     29.0  7.50   0     31.0  3.92   0      1
2     49.6  9.00   0     50.0  3.85   0      1
      27.6  6.00   0     22.0  4.00   0      1
      32.0  9.00   0     24.0  3.85   0      1
      32.2  9.00   0     27.0  3.85   0      1
      34.7  8.50   0     30.0  4.04   0      1
      35.7  9.00   0     19.0  3.93   0      1
      38.3  7.00   0     21.0  3.73   0      1
      59.6  9.25   0     46.0  3.72   0      1
1     71.8  7.60   1     65.0  3.88   0      1
2     64.7  5.00   1     27.0  3.90   0      1
            5.50   1     35.0  3.91   0      1
      67.4  4.50   1     39.0  3.66   0      1
      79.7  4.75   1     47.0  3.27   0      1
      88.0  5.50   1     26.0  3.43   0      1
      90.0  4.00   1     43.0  3.57   0      1
      26.5  7.00   0     34.0  4.06   0      1
1     71.6  8.00   1     35.0  3.88   0      1
      35.3  8.50   0     39.0  3.79   0      1
      46.5  6.50   0     35.0  3.69   0      1
      36.3  6.50  

### Use the following information to complete tasks 3 - 8

The mean nitrogen composition in living individuals is **4.3g per 100g of bone**.  

We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans).

**Task 3 -** Statistical hypotheses

From the list of choices below, select the null and alternative hypotheses using the experiment information described above.

A: Ho: There is no association between the nitrogen composition of living and non-living bones vs. Ha: There is an association between the nitrogen composition of living and non-living bones.

B: $H_0: \mu = 4.3$ vs. $H_a: \mu \neq 4.3$

C: $H_0: \mu_{living} \neq \mu_{dead}$ vs. $H_a: \mu_{living} = \mu_{dead}$

D: $H_0: \mu_{living} = \mu_{dead}$ vs. $H_a: \mu_{living} \neq \mu_{dead}$

In [85]:
Answer = 'B'

**Task 4 -** Statistical distributions

From the list of choices below, select the appropriate statistical test for the study described above.

A: A two-sample t-test

B: A chi-square test

C: A one-sample t-test

D: A Bayesian test

In [86]:
Answer = 'C'

**Task 5** - Hypothesis testing

Use a built-in Python function to conduct the statistical test you identified earlier. The scipy stats module has been imported.

In [87]:
import scipy.stats as st

In [88]:
from scipy.stats import t

In [89]:
df.head(1)

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
0,1,88.5,7.0,1,,3.88,1


In [90]:
from scipy import stats
t,p = stats.ttest_1samp(df['Nitro'],4.3)
print(t,p)

-18.523756974519692 1.5721226013800768e-21


**Task 6**

Select the correct conclusion at the 0.05 significance level from the list of choices below.

A: We reject the null hypothesis at the 0.05 significance level and conclude that the mean long bone nitrogen composition for skeletons is different than the mean long bone nitrogen composition in living individuals.

B: We fail to reject the null hypothesis at the 0.05 significance level and conclude that the mean long bone nitrogen composition for skeletons is different than the mean long bone nitrogen composition in living individuals.

C: We reject the null hypothesis at the 0.05 significance level and conclude that the mean long bone nitrogen composition for skeletons is the same as the mean long bone nitrogen composition in living individuals..

D: We fail to reject the null hypothesis at the 0.05 significance level and conclude that the mean long bone nitrogen composition for skeletons is the same as the mean long bone nitrogen composition in living individuals.


In [91]:
Answer = 'A'

**Task 7** - Confidence Interval

Calculate a 95% confidence interval for the mean nitrogen composition in the longbones of a deceased individual using the t.interval function.


In [92]:
from scipy.stats import t

t.interval('alpha' == 0.95,df=df.shape[1]-1,loc=df['Nitro'].mean(),scale=df['Nitro'].sem())


(3.787857142857143, 3.787857142857143)

In [93]:
mean_nitro = df['Nitro'].mean()
t_star = t.ppf(0.95,df=df.shape[1]-1)
se_nitro = df['Nitro'].sem()
l = mean_nitro = t_star*se_nitro
u = mean_nitro = t_star*se_nitro
print(l,u)

0.053724841138487335 0.053724841138487335


**Task 8**

Select the correct interpretation of the 95% confidenc interval from the statements below.

A: In 95% of samples, the mean longbone nitrogen composition in skeletons is between 3.73 and 3.86 grams per 100g of bone.

B: We are 95% confident that the population mean longbone nitrogen composition in skeletons is between 3.73 and 3.86 grams per 100g of bone.

C: We are 95% confident that the sample mean longbone nitrogen composition in skeletons is between 3.73 and 3.86 grams per 100g of bone.

D: We are 95% confident that the mean longbone nitrogen composition in skeletons is between 34.3 grams per 100g of bone.


In [94]:
Answer = 'B'

## Part B: A/B Testing

### Use the following information to complete tasks 9 - 18

### A/B Testing and Udacity

Udacity is an online learning platform geared toward tech professionals who want to develop skills in programming, data science, etc.  These classes are intensive - both for the students and instructors - and the learning experience is best when students are able to dedicate enough time to the classes and there is not a lot of student churn.

Udacity wished to determine if presenting potential students with a screen that would remind them of the time commitment involved in taking a class would decrease the enrollment of students who were unlikely to succeed in the class.

At the time of the experiment, when a student selected a course, she was taken to the course overview page and presented with two options: "start free trial", and "access course materials".

If the student clicked "start free trial", she was asked to enter her credit card information and was enrolled in a free trial for the paid version of the course (which would covert to a paid membership after 14 days).

If the student clicked "access course materials", she could view the videos and take the quizzes for free but could not access all the features of the course such as coaching.

*Credit*: [Udacity A/B testing final project example](https://https://www.udacity.com/course/ab-testing--ud257?irclickid=W0WQs22htxyLTIxwUx0Mo3YgUkEzM2Rn81NW2g0&irgwc=1&utm_source=affiliate&utm_medium=&aff=27795&utm_campaign=_khm68yp1xv02l1pj0mzy8__)

In [98]:
import pandas as pd
import numpy as np
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Udacity%20AB%20testing%20data/AB%20testing%20data.csv'
ABtest_ = pd.read_csv(data_url)

print(ABtest_.shape)
ABtest_.head()

(999, 10)


Unnamed: 0,Date,C-Pageviews,C-Clicks,C-Enrollments,C-Payments,E-Pageviews,E-Clicks,E-Enrollments,E-Payments,Unnamed: 9
0,"Sat, Oct 11",7723.0,687.0,134.0,70.0,7716.0,686.0,105.0,34.0,
1,"Sun, Oct 12",9102.0,779.0,147.0,70.0,9288.0,785.0,116.0,91.0,
2,"Mon, Oct 13",10511.0,909.0,167.0,95.0,10480.0,884.0,145.0,79.0,
3,"Tue, Oct 14",9871.0,836.0,156.0,105.0,9867.0,827.0,138.0,92.0,
4,"Wed, Oct 15",10014.0,837.0,163.0,64.0,9793.0,832.0,140.0,94.0,


In [100]:
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Udacity%20AB%20testing%20data/AB_test_payments.csv'

ABtest = pd.read_csv(data_url, skipinitialspace=True, header=0)

print(ABtest.shape)
ABtest.head()

(7208, 3)


Unnamed: 0,UserID,Group,Payment
0,0,Control,1
1,1,Control,1
2,2,Control,1
3,3,Control,1
4,4,Control,1


**Task 9 -** Statistical hypotheses

From the list of choices below, select the null and alternative hypotheses using the experiment information described above.

A: Ho: There is no association between the screen a customer viewed and whether the student became a paying customer vs. Ha: There is an association between the screen a customer viewed and whether the student became a paying customer.

B: Ho: There is an association between the screen a customer viewed and whether the student became a paying customer vs. Ha: There is no association between the screen a customer viewed and whether the student became a paying customer.

C: $H_0: \mu_{experiment} \neq \mu_{control}$ vs. $H_a: \mu_{experiment} = \mu_{control}$

D: $H_0: \mu_{experiment} = \mu_{control}$ vs. $H_a: \mu_{experiment} \neq \mu_{control}$

In [96]:
Answer = 'A'

**Task 10** - Frequency and relative frequency

Calculate the frequency and relative frequency of viewing the control version of the website and the experimental version of the website.

In [102]:
group_freq = ABtest['Group'].value_counts()
group_freq

Control       3785
Experiment    3423
Name: Group, dtype: int64

In [103]:
group_freq = ABtest['Group'].value_counts(normalize=True)
group_freq

Control       0.525111
Experiment    0.474889
Name: Group, dtype: float64

In [104]:
group_pct = group_freq*100
group_pct

Control       52.511099
Experiment    47.488901
Name: Group, dtype: float64

**Task 11** - Frequency and relative frequency

Calculate the frequency and relative frequency of converting to a paying customer.


In [106]:
ABtest.head()

Unnamed: 0,UserID,Group,Payment
0,0,Control,1
1,1,Control,1
2,2,Control,1
3,3,Control,1
4,4,Control,1


In [107]:
pay_freq = ABtest['Payment'].value_counts()
pay_freq

1    3978
0    3230
Name: Payment, dtype: int64

In [108]:
pay_freq = ABtest['Payment'].value_counts(normalize=True)
pay_freq

1    0.551887
0    0.448113
Name: Payment, dtype: float64

In [109]:
pay_pct = pay_freq * 100
pay_pct

1    55.188679
0    44.811321
Name: Payment, dtype: float64

**Task 12** - Joint distribution

Calculate the joint distribution of experimental condition and conversion to a paying customer.


In [112]:
joint_dist = pd.crosstab(index=ABtest['Group'],columns=ABtest['Payment'])
joint_dist

Payment,0,1
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Control,1752,2033
Experiment,1478,1945


**Task 13** - Marginal distribution

Add the table margins to the joint distribution of experimental condition and conversion to a paying customer.

In [114]:
marginal_dist = pd.crosstab(index=ABtest['Group'],columns=ABtest['Payment'],margins=True)
marginal_dist

Payment,0,1,All
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Control,1752,2033,3785
Experiment,1478,1945,3423
All,3230,3978,7208


**Task 14 -** Conditional distribution

Calculate the distribution of payment conversion conditional on the text the individual saw when he or she was signing up for Udacity.

In [116]:
cond_dist = pd.crosstab(index=ABtest['Group'],columns=ABtest['Payment'],normalize=True,margins=True)*100
cond_dist

Payment,0,1,All
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Control,24.306326,28.204772,52.511099
Experiment,20.504994,26.983907,47.488901
All,44.811321,55.188679,100.0


**Task 15 -** Statistical distributions

Identify the appropriate statistical test to determine if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not he or she converted to a paying customer.

A: A two-sample t-test

B: A Bayesian test

C: A one-sample t-test

D: A chi-square test

In [117]:
Answer = 'D'

**Task 16** - Hypothesis testing

Conduct the hypothesis test you identified in Task 15.

In [120]:
from scipy.stats import chi2_contingency
x2_statistic,pvalue,dof,expctd = chi2_contingency(pd.crosstab(index=ABtest['Group'],columns=ABtest['Payment']))
print(x2_statistic)
print(pvalue)
print(dof)
print(expctd)

6.902249432727325
0.008608736615463934
1
[[1696.10849057 2088.89150943]
 [1533.89150943 1889.10849057]]


**Task 17**

Select the correct conclusion at the 0.05 significance level from the list of choices below.

A: We reject the null hypothesis at the 0.05 significance level and conclude that there is no association between the screen a student viewed and if the student became a paying customer.

B: We fail to reject the null hypothesis at the 0.05 significance level and conclude that there is no association between the screen a student viewed and if the student became a paying customer.

C: We reject the null hypothesis at the 0.05 significance level and conclude that there is a statistically significant association between the screen a student viewed and if the student became a paying customer.

D: We fail to reject the null hypothesis at the 0.05 significance level and conclude that there is a statistically significant association between the screen a student viewed and if the student became a paying customer.


In [None]:
Answer = 'C'

**Task 18** - Visualization

Draw a side-by-side barplot illustrating the distribution of conversion by experimental group.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.barplot(data=ABtest,x='Group',y='Payment',ci=None)

**Task 19** - Bayesian and Frequentist Statistics

In a few sentences, describe the difference between Bayesian and Frequentist statistics.

This task will not be autograded - but it is part of completing the challenge.

**Task 19 ANSWER:**

SHORT ANSWER HERE : The difference is that one of takes in P(Hypothesis|Data) whie the other is reciprocal.