# Sprint 2 Challenge

### Dataset description: 

Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)? 

Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv dataset. 

You can find Longbones.csv here: https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv

**What can we learn about the bodies that were buried in the cemetery?**

The variable names are:

Site = Site ID, either Site 1 or Site 2

Time = Interrment time in years

Depth = Burial depth in ft.

Lime = Burial with Quiklime (0 = No, 1 = Yes)

Age = Age at time of death in years

Nitro = Nitrogen composition of the long bones in g per 100g of bone.

Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)

Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208

###1) Import the data 

Import the Longbones.csv file and print the first 5 rows.

In [1]:
#Import the dataset

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)



In [2]:
### YOUR CODE HERE ###
df.head()

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
0,1,88.5,7.0,1,,3.88,1
1,1,88.5,,1,,4.0,1
2,1,85.2,7.0,1,,3.69,1
3,1,71.8,7.6,1,65.0,3.88,0
4,1,70.6,7.5,1,42.0,3.53,0


###2) Check for missing data.

Is there any missing data in the dataset?  If so, in what variable(s)?  

In [5]:
### YOUR CODE HERE ###
df.isnull().sum()

Site     0
Time     0
Depth    1
Lime     0
Age      7
Nitro    0
Oil      0
dtype: int64

1 depth value and 7 age values are missing.

### 3) Remove any rows with missing data from the dataset.  If there is no missing data, write "No missing data" in the answer section below.

In [8]:
### YOUR CODE HERE ###
df = df.dropna()
df.isnull().sum()

Site     0
Time     0
Depth    0
Lime     0
Age      0
Nitro    0
Oil      0
dtype: int64

If there are no NA's, indicate that here. 

#Use the following information to answer questions 4) - 7) 

The mean nitrogen compostion in living individuals is 4.3g per 100g of bone.  

We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans). 



###4) Using symbols and statistical language, write the null and alternative hypotheses outlined above.

$H_0: \mu =$ 4.3

$H_a: \mu \neq$ 4.3

###5) What is the appropriate test for these hypotheses?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

A t-test would be the correct test to use, as it can be used to compare the mean of a sample to a population mean.

###6) Use a built-in Python function to conduct the statistical test you identified in 5).  Report your p-value.  Write the conclusion to your hypothesis test at the alpha = 0.05 significance level.

In [11]:
### YOUR CODE HERE ###
from scipy import stats

nitro_pval = stats.ttest_1samp(df['Nitro'], 4.3)

print(nitro_pval)

Ttest_1sampResult(statistic=-16.525765821830365, pvalue=8.097649978903554e-18)


Since the p value of 8.1x10^-18 < 0.05, we can reject the null hypothesis, and conclude that the mean nitrogen composition per 100g of bone in the deceased is not equal to that of living humans.

###7) Create a 95% confidence interval for the mean nitrogen compostion in the longbones of a deceased individual.  Interpret your confidence interval in a sentence or two.

In [14]:
### YOUR CODE HERE ###
from scipy.stats import t

nitro_mean = df['Nitro'].mean()
nitro_std = df['Nitro'].std()
nitro_n = df['Nitro'].count()
nitro_se = nitro_std / nitro_n ** (1/2)

nitro_cl = t.interval(alpha=0.95, df=nitro_n - 1, loc=nitro_mean, scale=nitro_se)
print(nitro_cl)

(3.734020952024922, 3.8579790479750784)


We are 95% confident that the population mean nitrogen composition in the longbones of a deceased individual is between 3.73g and 3.86g per 100g of bone.

#Use the following information to answer questions 8) - 12) 


The researchers also want to learn more about burial practices in the parts of England where the two cemetaries in the study were located.  They wish to determine if burials with Quicklime are associated with the burial region.  

Their null hypothesis is that there is no association between cemetery site and burial with Quicklime.  The alternative hypothesis is that there is an association between cemetery site and burial with Quicklime.



###8) Calculate the joint distribution of burial with Quicklime by burial site.

In [16]:
### YOUR CODE HERE ###
burial_ql_joint = pd.crosstab(df['Lime'], df['Site'], margins=True)
burial_ql_joint

Site,1,2,All
Lime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,14,9,23
1,5,7,12
All,19,16,35


###9) Calculate the conditional distribution of burial with Quicklime by burial site.

In [18]:
### YOUR CODE HERE ###
burial_ql_conditional = pd.crosstab(df['Lime'], df['Site'], normalize='index')*100
burial_ql_conditional

Site,1,2
Lime,Unnamed: 1_level_1,Unnamed: 2_level_1
0,60.869565,39.130435
1,41.666667,58.333333


###10) What is the appropriate test for the hypotheses listed above?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

A chi-square test would be the correct test to use, as it calculates if there is a relationship between 2 categorical variables.

###11) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [21]:
### YOUR CODE HERE ###
from scipy.stats import chi2_contingency

g, p_val, dof, expctd = chi2_contingency(pd.crosstab(df['Lime'], df['Site']))

print(p_val)

0.4684181967877057


Since the p value of 0.468 > 0.05, we fail to reject the null hypothesis that there is no association between cemetery site and burial with Quicklime.

###12) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [22]:
### YOUR CODE HERE ###
g, p_val, dof, expctd = chi2_contingency(pd.crosstab(df['Lime'], df['Site']))

print(p_val)

0.4684181967877057


Since the p value of 0.468 > 0.05, we fail to reject the null hypothesis that there is no association between cemetery site and burial with Quicklime.

###13) In a few sentences, describe the difference between Bayesian and Frequentist statistics.

Frequentist statistics looks purely at data to make conclusions, while Bayesian statistics uses data and current beliefs about about that data to inform the calculation.